Big Data, Fast Data, Internet of Things, Machine Learning – what is the current landscape? There are many tools – Hadoop, Kafka, Spark to make the management of data easier and faster, but what is the right approach?
I will contend that there is no one “right approach”. Solutions are changing fast and often. This doesn’t mean that you should do nothing and wait. You should understand the tools available, your needs, and then proceed with a “big picture” approach using a platform that offers support for all the popular tools, along with the collateral business logic you will need to interact with your analytics.
In this blog I explain the current environment, existing tools, the pros and cons, and tips on how you can navigate the changing face of data analytics.
Want to hear more, in person? Join me and the Mesosphere team for a Meetup January 11 7-9pm in Playa del Rey, CA. Register here.
Hadoop Origin Story
When Google published the “Google File System” and “MapReduce” papers, it inspired a developer community to apply the concepts in what became the Hadoop project, released in 2006.
Hadoop saw a quick uptake with some rising internet giants of the era (Yahoo, Twitter, Facebook, and LinkedIn). Later, supported distributions (Cloudera, Hortonworks) led traditional enterprises to embark on Hadoop based projects.
The path to present day has not been without issues. Some might even say that Hadoop has lived a reverse “superhero origin story”, starting out with much excitement and fanfare, only to see the world move away, becoming an orphan as distros bundle Spark for use with Hadoop’s Zookeeper and HDFS components.
“I can’t find a happy Hadoop customer. It’s sort of as simple as that,” says Bob Muglia. His role CEO of Snowflake Computing may justify suspicion, but Bobby Johnson who built Facebook’s Hadoop analytics stack is also a critic: “there’s a bunch of things that people have been trying to do with it for a long time that it’s just not well suited for.” He goes on to criticize Hadoop’s complexity and performance. Source: “Hadoop Has Failed Us, Tech Experts Say”.
Some criticism may be based on unwarranted expectations, along with advances in technology. A lot has changed in the 11 years since Hadoop’s architecture was forged. Network bandwidth is higher, storage latency lower, and many use cases call for near immediate results.
Hadoop was built for batch processing of big datasets. It delivered advantages compared to traditional relational databases when dealing with unstructured data, but was subject to misapplication and disappointing results in use cases needing transaction processing and connections to legacy business logic built for relational databases. Batch processing is a poor match for interactive user facing applications.
The Changing Nature of Demand
The world is moving away from batch oriented to stream oriented data processing because people want notifications and answers faster. Batch will never go away, but you can run batch jobs through stream processing, while the converse is problematic. It very difficult to serve mobile apps, IoT, interactive gaming and many other application types responsively through batch processing.
“Most decisions should probably be made with somewhere around 70% of the information you wish you had. If you wait for 90%, in most cases, you’re probably being slow.”
If you follow discussions about the Internet of Things, you’ve heard some stunning predictions of devices counts and markets size. Even if you are skeptical as to the exact numbers, it is fair to make these observations:
- IoT is going to generate way more data than we have today. Even a tiny device can generate a lot of data. And we will have a lot of devices.
- Tiny devices are not going to hold data for long. It will need to go somewhere, or be lost.
- Some control loops and user interaction requirements will demand low latency or be so critical (public safety or economic loss issues) that processing will need to be done quickly, and near the data origination point, rather than in a centralized cloud.
- While there are tasks that can be characterized as having a discrete lifecycle with a start and end, other activities are essentially continuous processing with never ending inputs and outputs.
- If you intend to utilize IOT inputs for business purposes, and don’t take advantage of machine learning in the workflow, you will miss opportunities, be too slow, and incur unneeded costs.
The Progression of Data Analytics to Become Faster
Hadoop is a distributed computing framework. It uses the MapReduce programming paradigm to execute jobs in parallel on a multi-node cluster. The Hadoop project spawned an ecosystem of projects. Some like Zookeeper and the HDFS filesystem continue to be repurposed and used in other projects. Others such as MapReduce, have been effectively left behind by replacements with more attractive attributes.
Memory vs Disk
Hadoop IO bandwidth: HDFS (based on disk) was OK, in 2006. It’s use of HDFS to stage intermediate results is a processing speed “bottleneck” by modern standards.
Spark is designed to do faster processing on in-memory data. It can utilize Hadoop’s HDFS where persistence is required. It is a distributed computing library – not a complete framework – it is designed to work as a pipeline used with additional “plug-able” components to ingest, process and store data.
Batch vs Micro-batching
The move to memory vs. disk is not the only avenue of change in data analytics.
Hadoop is limited to batch processing of a job.
Spark can be used for batch processes, as well as near real time stream processing. Spark supports “micro-batching” where batch sizes are many times smaller. (Spark does not support streaming in the strictest sense but it can provide pseudo real time streaming when used with a very small batch size.)
Source: Apache Foundation
Other open source projects such as Apache Storm do stream processing in a strict sense (processing handles a single event at a time), but can also support micro-batching. (For comparison, Spark might have a latency measured in seconds, while Storm might produce results in milliseconds.)
Dealing with Complexity
Bringing Home the “free puppy”
Many an enterprise analytics project got started when a developer went to a tech conference and had an “Aha!, what if?” experience.
Alas, there is rarely a single data input source
A prototype looks good and an initial proof of concept succeeds. But as the project moves toward production it is scaled up with more than one data source.
And there is rarely a single use for a data input source
And other useful applications for these same inputs are identified. Some uses might accommodate batch processing. Others might demand faster processing (micro-batching or true streaming).
Suddenly the solution becomes the problem
A combinatorial explosion becomes difficult to manage – from a development and operational perspective.
Enter the message exchange model
Apache Kafka stores messages which come from arbitrarily many processes called “producers”. The data can thereby be partitioned in different “partitions” within different “topics”. Within a partition the messages are indexed and stored together with a timestamp. Other processes called “consumers” can query messages from partitions. Kafka runs on a cluster of one or more servers and the partitions can be distributed across cluster nodes.
Apache Kafka efficiently processes the real-time and streaming data when implemented along with Apache Storm, Apache HBase, and Apache Spark. Deployed as a cluster on multiple servers, Kafka handles its entire publish and subscribe messaging system with the help of four APIs.
Kafka creator Jay Kreps was responsible for running a large Hadoop cluster at LinkedIn. Kreps and other engineers from the project left to form Confluent, a company focussed on Kafka.
Source: The New Stack
Kafka is designed to be a general purpose message broker that can handle millions of rapid-fire events. It features low latency, with “at-least-once”, delivery of messages to consumers. Kafka also supports queuing of data for offline consumers, supporting both real-time or offline consumption – which is useful for batch operations, and tolerance of temporary outages at a consumption layer. Performance is compatible with real time use cases.
The progression from batch, to microbatch, to streaming, to messaging is not likely to stop there. This is suitable for many classes of application – but not all. And if there is any lesson to be observed in the history of data analytics, it is that human ingenuity will result in new and better solutions in the near future.
Kafka can generally deliver “at least once” delivery, but a playful mocking remark is that the two hardest problems in distributed systems are:
- Exactly once delivery of messages
- Preservation of order of messages
- Exactly once delivery of messages
Certain patterns can be used to throw out duplicates, but a standardized solution would be attractive for some forms of applications.
Application Requirements variation
Pravega is an example of recent work that is continuing the pattern of new open source projects, attempting to supplement, or replacing the old. It is a streaming storage solution that can deliver transactional behavior with exactly once delivery and order preservation, along with long term data retention
How to Choose your Analytics Solution
If you have been following along, you might conclude that you should just jump in and deploy
Hadoop, Spark, Kafka. Alas it’s not that simple.
Any realistic production application will add more components: business logic, a user interface, and an assortment of collateral stateful backing stores and microservices. These will plug into and out of your analytics pipeline.
And if there is a lesson to be learned from the 10+ history since the advent of Hadoop – whatever is the “best” solution at this point in time, will not be in a few years.
Under the Apache Foundation alone there are 38 projects under the “big data” category, perhaps 12 of which can be classified as built to handle streaming. We will define streaming as a never ending sequence of records potentially originating from multiple sources.
Tradeoffs include low vs high latency, ability to horizontally scale to handle changing volume, machine learning support, filtering and transform plugin support, and support for legacy interfaces such as SQL.
This scenario is not unlike what happened in the 90’s during the early days of the internet. Someone circa 1995 might have asked “What language should I use to build a website?” This would have been the wrong question to ask.
In the 90’s, books and articles were published, with tables comparing features of Java vs .net etc. These did little more than list the general properties of each platform, leaving an impression that you might be equally successful with any of them. While technically true, the world quickly moved on to more highly evolved “suites” that offered publication of your website at a much much higher level of abstraction – with perhaps the most important feature being flourishing user communities that supported conferences, support forums, and best-practice articles and presentations.
Something very much like this is taking place now in the form of the “SMACK Stack”.
Your solution should be based on Container Orchestration
Rather than give you a table comparing features of 12+ streaming analytics tools, (likely obsolete within weeks anyway), my advice is to seek a platform which can host all the popular tools, and then choose whatever shows the most traction in the form of user experience stories in media and at tech conferences. Be prepared to see your choice leapfrogged a few years later. Expect that you will use multiple, different streaming platforms simultaneously, either because you are in transition, or have applications with varying requirements. Accept that this is a good thing. This is exactly why you need a platform that is versatile and flexible enough to host whatever comes along.
Rather than pick a winner between VHS and BetaMax, understand that a replacement by DVD, Blu-ray, etc. will come along. Base your technology choice on flexibility to change.
You also want a platform that supports whatever collateral applications and services you are likely to use. Things like TensorFlow, SQL and NoSQL datastores, and scalable user interface platforms.
The Internet of Things may lead you to require data ingestion and analytics compute capacity at edge locations, along with applications in public clouds. Bandwidth constraints, response latency needs, and resiliency can demand local processing and data reduction. Having a solution that makes applications run portably from edge to public cloud matters.
You need to retain freedom to choose what you run and where you run it. Even if you identify and deploy the best solution available today, something better will eventually come around. What you need is a flexible platform that lets you deploy not just analytics tools of your choice, but the collateral applications and services that are needed to go with it. And you want this to be simple to scale and maintain. You want to train your staff in one technology/API and reuse these skills everywhere.
Expect that your needs will change annually, seasonally and even on intervals of seconds based on changing workloads
The Apache Mesos, DC/OS, and Kubernetes platforms are the leading candidates for this role. These fall into the category of container orchestrators, and can be overlayed on top of public clouds or an on-prem data center. They are designed to deploy broad ranges of applications and services under automation.
These orchestrators handle mapping of dynamically changing workloads to variable resources, along with the service discovery, networking, and security issues that arise in practice. Their use of container technology allows better probability and more responsive scaling compared to virtualization under hypervisors alone.
Apache Mesos and DC/OS
Most of the leading open source analytics projects are sponsored by the Apache foundation, leading to a high correlation of user deployments on the Mesos platform. These projects publish container images as part of their release cycles. So a platform that allows to to deploy workloads at a container level of abstraction is critical.
DC/OS (based on Apache Mesos) supports “app store” like deployments of Spark, Kafka, Cassandra, etc.
Kubernetes (and DC/OS, SWARM, PKS distributions)
Kubernetes can be used in conjunction with the Helm project, StatefulSets, and Operators to automate deployments and management of a modern analytics applications stack. Example: https://github.com/kubernetes/charts/tree/master/incubator/kafka
As a certified Kubernetes distribution, Docker Enterprise Edition, and VMware Pivotal Container Service (PKS) are also capable of supporting streaming analytics applications in a flexible fashion while maintaining portability from edge to public clouds.