Real-time data holds potentially high value for business but it also comes with a perishable expiration date. If the value of this data is not realized in a certain window of time, its value is lost and the decision or action which was needed as a result never occurs. Such data comes continuously and quite quickly, therefore, we call it streaming data. Data streaming requires special attention as sensor reading changing rapidly, blip in log file, sudden price change holds immense value but only if it alerted in time.
Although there are many technologies available, still while considering streaming in a data lake it is necessary to have a well-executed data lake which offers strict rules and processes in terms of ingestion.
Here are some real time data streaming tools and technologies.
Apache Flink is a streaming data flow engine which aims to provide facilities for distributed computation over streams of data. Treating batch processes as a special case of data streaming, Flink is effective both as a batch and real-time processing framework but it puts streaming first. Flink offers a number of APIs which includes static data API like DataStream API, DataSet API for Java, Scala and Python and SQL-like query API for embedding in Java, Scala static API code. Flink also has its own machine learning library called FlinkML, its own SQL Query called MRQL as well as graph processing libraries.
Compared to Spark and Storm, Flink is more stream-oriented. It is something of a hybrid between Spark and Storm. Spark operates in batch mode. Flink also provides a highly flexible streaming window for the continuous streaming model. This ensures that both batch and the real-time streaming gets integrated into one system.
- Highly Flexible Streaming Windows for Continuous Streaming Model.
- Batch and Streaming in one system.
- Flink is integrated with many other open-source data processing ecosystems.
Apache Storm is a distributed real-time computation system. Its applications are designed as directed acyclic graphs. Storm can be used with any programming language. It is known for processing over one million tuples per second per node which is highly scalable and provides processing job guarantees. Storm is written in Clojure which is the Lisp-like functional-first programming language.
Storm is used for distributed machine learning, real-time analytics, and numerous other cases, especially with high data velocity. Storm runs on YARN and integrates with Hadoop ecosystems. Storm is a stream processing engine without batch support, a true real-time processing framework, taking in a stream as an entire ‘event’ instead of series of small batches. Storm has low latency and is well-suited to data which must be ingested as a single entity. Storm does suffer from a lack of direct YARN support. Storm is a bridge between batch processing and stream processing, which Hadoop is not natively designed to handle.
- Storm is known for processing one million 100 byte msgs/sec/node.
- It is scalable which works on parallel calculations that run across a cluster of machines.
- Storm is reliable. It guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.
Kafka and Kinesis are very similar. Although Kafka is free and requires you to make it into an enterprise-class solution for your organization. But Amazon came to the rescue by offering Kinesis as an out of the box streaming data tool. Kinesis comprises of shards which Kafka calls partitions. For organizations that take advantage of real-time or near real-time access to large stores of data, Amazon Kinesis is great.
Kinesis Streams solves a variety of streaming data problems. One common use is the real-time aggregation of data which is followed by loading the aggregate data into a data warehouse. Data is put into Kinesis streams. This ensures durability and elasticity. Amazon Kinesis is a managed, scalable, cloud-based service which allows real-time processing of large data streams.
- Kinesis is all about real-time data.
- Kinesis Firehose ingests real-time data into data stores like S3, Elasticsearch or Redshift for batch analytics.
- Kinesis Analytics helps you to analyze data in real-time.
Apache Samza is another distributed stream processing framework which is tightly tied to the Apache Kafka messaging system. Samza is designed specifically to take advantage of Kafka’s unique architecture and guarantees fault tolerance, buffering and state storage.
Samza uses YARN for resource negotiation. This means that by default, a Hadoop cluster is required and Samza relies on rich features built into YARN. Samza is able to store state by using a fault-tolerant checkpointing system which is implemented as a local key-value store. Therefore, this helps Samza to offer at least one delivery guarantee, though it does not offer reliability and accuracy of recovery of the aggregated state in the event of failure. It also offers high-level abstractions which in many ways is easier to work with than primitive options provided by systems like Storm. Samza only supports JVM language which does not have the same language flexibility as Storm.
- Simple API: Samza provides a very simple callback-based “process message” API as compared to MapReduce.
- Managed state: Samza manages snapshotting and restoration of stream processor’s state.
- Fault tolerance: Samza works with YARN whenever a machine in the cluster fails in order to transparently migrate your tasks to another machine.
- Scalability: Samza is partitioned and distributed at all levels.
Kafka is a distributed publish-subscribe messaging system which integrates applications/data streams. It was originally developed at Linkedin Corporation and later became a part of Apache project. Therefore, Apache Spark is fast, scalable and reliable messaging system which is the key component in Hadoop technology stack for supporting real-time data analytics or monetization of Internet of Things (IoT) data.
Kafka can handle many terabytes of data without incurring much at all. Apache Kafka is altogether different from the traditional messaging system. It is designed as a distributed system and which is very easy to scale out.Kafka is designed to deliver three main advantages over AMQP, JMS etc.
- Highly Reliable: Kafka replicates data and it can support multiple subscribers. In the event of failure, it automatically balances consumers in the event of failure which is very much reliable in comparison to similar messaging services.
- Superbly Scalable: Kafka, which is a distributed system, is able to scale quickly and easily without incurring any downtime.
- High Performance: For both publishing and subscribing, Kafka delivers high throughput. It is capable of offering constant levels of performance even when it deals with many terabytes of stored messages.
- Durable: Kafka provides intra-cluster replication by keeping messages on the disks which make it durable messaging system.
We have plenty of options for processing within a big data system. For stream-only workloads, Storm has wide language support and therefore can deliver very low latency processing. Kafka and Kinesis are catching up fast and providing their own set of benefits. For batch-only workloads which are not time-sensitive, Hadoop MapReduce is a great choice.
For mixed kind of workloads, Spark offers high-speed batch processing and micro-batch processing for streaming. Flink is also becoming popular and is positioned as an alternative to Spark. Of course, the best fit for your situation will depend a lot on the state of the data to process, your infrastructure preference, actual business use case and what kinds of results you are interested in.
Also, don’t forget to grab some more knowledge on BI:
References: resources.zaloni, upside.tdwi, dzone, docs.aws.amazon, medium, digitalocean, syncsort, infoq,