Apache Storm
![]() Distributed and fault-tolerant realtime computation | |
Original author(s) | Nathan Marz (BackType) |
---|---|
Developer(s) | Apache Software Foundation |
Initial release | September 17, 2011[1] |
Stable release | 2.8.0
/ 25 January 2025[2] |
Repository | Storm Repository |
Written in | Java (formerly Clojure) |
Operating system | Cross-platform |
Platform | Java VM (Java 17+) |
Type | Stream processing, Distributed computing |
License | Apache License 2.0 |
Website | storm |
Apache Storm is a distributed real-time computation system for processing large streams of data.[3] Originally created by Nathan Marz and team at BackType,[4] the project was open sourced after being acquired by Twitter in 2011.[5] Storm processes unbounded streams of data in a reliable, fast, and scalable way, making it suitable for real-time analytics, machine learning, continuous computation, distributed RPC, ETL, and more. As of 2025, Storm remains actively maintained and deployed in production environments, though it faces competition from newer stream processing frameworks that have built upon its foundational concepts.[6]
Overview
[edit]Apache Storm is designed to process unbounded streams of data reliably at scale. Unlike batch processing systems like Apache Hadoop, Storm processes data continuously as it arrives, enabling sub-second latencies.[7] The framework emphasizes:
- Real-time processing: Data is processed immediately upon arrival
- Scalability: Ability to scale horizontally to thousands of nodes
- Fault tolerance: Automatic recovery from failures with message replay
- Guaranteed message processing: At-least-once and exactly-once semantics
- Language agnostic: Support for any programming language through Thrift
According to independent benchmarks, Storm can process over one million small messages per second on modest hardware, though actual performance depends heavily on use case and message complexity.[8]
History
[edit]Origins at BackType
[edit]Apache Storm was originally developed at BackType, a social analytics company founded by Nathan Marz and Michael Montano. The team created Storm to handle the massive streams of social media data that BackType processed for its analytics products.[9] Storm was designed to address limitations in existing real-time processing systems, providing a simple programming model similar to MapReduce but for streaming data.[10]
Acquisition by Twitter
[edit]In July 2011, Twitter acquired BackType, bringing Storm's development in-house.[11] Twitter open-sourced Storm on September 19, 2011, at the Strange Loop conference, making it available to the broader community.[5] This move was part of Twitter's strategy to share its infrastructure technologies with the open-source community, similar to their approach with other projects like Bootstrap and Finagle.[12]
Apache incubation and graduation
[edit]In September 2013, Storm entered the Apache Incubator, beginning its transition to the Apache Software Foundation.[13] After a year of incubation, Storm graduated to become an Apache Top-Level Project in September 2014, demonstrating its maturity and sustainable community.[14]
Major releases and evolution
[edit]Version 1.0 (April 2016): Marked API stability and production readiness, introducing significant performance improvements and the new Flux framework for topology configuration.[15]
Version 2.0 (May 2019): A major architectural overhaul that reimplemented Storm's core in Java (from Clojure), resulting in significant performance improvements and making the codebase more accessible to contributors.[16] Industry analysts noted this as a crucial step in maintaining Storm's relevance in the evolving stream processing landscape.[17]
Recent versions (2023-2025):
- Version 2.6.0 (November 2023): Dropped Java 8 support, removed deprecated components[18]
- Version 2.7.1 (November 2024): Last release supporting Java 11[19]
- Version 2.8.0 (January 2025): Current release requiring Java 17+[2]
Architecture
[edit]Core concepts
[edit]Apache Storm's architecture is built around several key abstractions:[20]
Topology: The logical structure of a Storm application, represented as a directed acyclic graph (DAG) where data flows from node to node. Topologies run continuously until explicitly killed, processing data as it arrives.
Spouts: Sources of streams in a topology. Spouts read data from external sources (e.g., Kafka, databases, APIs) and emit tuples (ordered lists of data elements) into the topology.
Bolts: Processing units that consume input streams, perform computations, and optionally emit new streams. Bolts can perform any processing including filtering, aggregations, joins, and database interactions.
Streams: Unbounded sequences of tuples flowing between spouts and bolts. Each stream has a schema defining the fields contained in its tuples.
Tuples: The basic unit of data in Storm, representing an immutable ordered list of values (similar to a database row or a list in programming).
Cluster architecture
[edit]A Storm cluster operates on a master-worker architecture pattern:[21][22]
Nimbus (Master Node):
- Distributes code across the cluster
- Assigns tasks to worker nodes
- Monitors for failures and reassigns tasks
- Similar to Hadoop's JobTracker but stateless
Supervisor (Worker Nodes):
- Listens for work assigned by Nimbus
- Starts and stops worker processes
- Each worker node runs a Supervisor daemon
ZooKeeper:
- Provides distributed coordination between Nimbus and Supervisors
- Stores cluster configuration and state
- Enables Nimbus high availability through leader election
Worker Processes:
- Execute a subset of a topology
- Run one or more executors (threads)
- Communicate with other workers for tuple routing
Processing guarantees
[edit]Storm provides configurable levels of message processing guarantees:[23]
- At-most-once: Messages may be lost but are never reprocessed (fastest, lowest overhead)
- At-least-once: Messages are never lost but may be reprocessed (default guarantee)
- Exactly-once: Each message is processed exactly once (available via Trident API, higher latency)
The framework achieves these guarantees through a tuple acknowledgment system that tracks the complete processing of each tuple and its derivatives through the entire topology DAG.
Features
[edit]Programming APIs
[edit]Storm offers multiple APIs catering to different use cases and expertise levels:[24]
Core API: The basic spout and bolt API providing maximum control and flexibility for experienced developers.
Trident API: High-level abstraction for stateful stream processing with exactly-once semantics. Provides operations like aggregations, joins, and windowing similar to batch processing APIs.
Stream API: Functional-style API introduced in Storm 2.0 for expressing computations using operations like map, filter, and aggregate, similar to Java 8 Streams.
SQL Integration: Support for SQL queries over streaming data through integration with Apache Calcite, allowing analysts to use familiar SQL syntax.[25]
Flux: YAML-based framework for defining and deploying topologies without writing Java code, improving accessibility for operations teams.
State management
[edit]Storm provides several mechanisms for managing state in streaming applications:[26]
- Stateful bolts: Built-in support for maintaining state with automatic checkpointing and recovery
- External state stores: Integration patterns for databases, caches, and key-value stores
- Trident state: Sophisticated state management with automatic batching and exactly-once updates
Integration capabilities
[edit]Storm's ecosystem includes connectors for numerous external systems:[27]
- Message queues: Apache Kafka, RabbitMQ, Amazon Kinesis, Apache Pulsar
- Databases: Cassandra, MongoDB, Redis, HBase, JDBC-compatible databases
- Storage systems: HDFS, Amazon S3, Apache Solr, Elasticsearch
- Monitoring: Integration with metrics systems like Graphite, Ganglia, and custom metrics collectors
Performance and benchmarks
[edit]Independent performance evaluations have shown Storm's capabilities in various scenarios. A 2016 IEEE study comparing Storm, Flink, and Spark Streaming found that Storm excelled in low-latency scenarios while consuming fewer resources than alternatives.[8] However, the study also noted that Storm's throughput was lower than Flink for high-volume workloads.
Yahoo! Engineering reported processing 1.5 million events per second per node in their production Storm clusters, though they noted careful tuning was required to achieve these numbers.[28]
Use cases and adoption
[edit]Common applications
[edit]Storm is widely deployed for various real-time processing scenarios:[29]
- Real-time analytics: Computing metrics, KPIs, and dashboards on streaming data
- Continuous computation: Updating databases and caches with streaming computations
- Online machine learning: Training and scoring models in real-time
- Distributed RPC: Parallelizing intensive computations on demand
- ETL pipelines: Real-time data transformation, enrichment, and loading
- Complex event processing: Pattern detection, anomaly detection, and alerting
Industry adoption
[edit]Several companies have publicly documented their Storm deployments:
Twitter uses Storm for real-time analytics on tweet streams, processing billions of events daily for trending topics and user analytics.[30]
The Weather Channel processes weather data from thousands of sources using Storm to provide real-time weather updates and alerts.[31]
Alibaba reported using Storm for real-time computation in their recommendation systems and monitoring infrastructure.[32]
Limitations
[edit]While Storm pioneered many concepts in stream processing, it has several limitations compared to newer frameworks:[33]
- Event-time processing: Storm primarily supports processing-time semantics, while newer systems like Flink provide sophisticated event-time windowing
- State management complexity: Managing large state requires careful design compared to systems with built-in state backends
- Backpressure handling: Storm's backpressure mechanisms are less sophisticated than those in Flink or newer versions of Spark
- Resource isolation: Process-level isolation can lead to resource inefficiency compared to thread-level systems
- SQL support limitations: Storm's SQL capabilities are basic compared to dedicated streaming SQL engines
Comparison with other systems
[edit]Storm was one of the first widely-adopted distributed stream processing systems and influenced many successors:[34]
vs. Spark Streaming: Storm processes events individually while Spark uses micro-batches. This makes Storm better for sub-second latency requirements but Spark better for throughput-oriented workloads that can tolerate higher latency.
vs. Apache Flink: Flink provides more sophisticated state management, better event-time processing, and higher throughput, while Storm offers simpler operations, lower latency for simple topologies, and a more mature ecosystem.
vs. Kafka Streams: Kafka Streams is embedded in applications while Storm requires a cluster. Kafka Streams is simpler for Kafka-centric architectures, while Storm is more flexible for heterogeneous data sources.
vs. Twitter Heron: Heron was created by Twitter as Storm's successor, maintaining API compatibility while improving performance, debuggability, and operational characteristics.[35]
Community and ecosystem
[edit]Development community
[edit]Apache Storm maintains an active open-source community:[36]
- Regular releases following semantic versioning
- Active mailing lists with hundreds of subscribers
- GitHub-based development with pull request workflow
- Comprehensive documentation maintained by the community
- Annual StormCon conference (2013-2016, discontinued)
Commercial support
[edit]Several vendors provide commercial support and managed services for Storm:
- Cloudera (formerly Hortonworks) includes Storm in their data platform
- Amazon Web Services offers Storm on EMR
- Microsoft Azure provides Storm clusters through HDInsight
- Various consultancies specialize in Storm deployment and optimization
See also
[edit]- Stream processing
- Lambda architecture
- Apache Kafka
- Apache Flink
- Apache Spark
- Apache Samza
- Complex event processing
- List of Apache Software Foundation projects
References
[edit]- ^ "Storm Initial Commit". GitHub. Retrieved May 29, 2025.
- ^ a b "Apache Storm 2.8.0 Released". storm.apache.org. Retrieved May 29, 2025.
- ^ "Apache Storm". storm.apache.org. Retrieved May 29, 2025.
- ^ Marz, Nathan. "About Nathan Marz". Nathan Marz. Retrieved May 29, 2025.
- ^ a b "A Storm is coming: more details and plans for release". Engineering Blog. Twitter Inc. September 19, 2011. Retrieved May 29, 2025.
- ^ Chen, Li (March 15, 2024). "The State of Stream Processing in 2024". InfoQ. Retrieved May 29, 2025.
- ^ "Storm Concepts". storm.apache.org. Retrieved May 29, 2025.
- ^ a b Chintapalli, Sanket; Dagit, Derek; Evans, Bobby; Farivar, Reza; Graves, Thomas; Holderbaugh, Mark; Liu, Zhuo; Nusbaum, Kyle; Patil, Kishorkumar; Peng, Boyang Jerry; Poulosky, Paul (May 2016). "Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming". 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE. pp. 1789–1792. doi:10.1109/IPDPSW.2016.138. ISBN 978-1-5090-3682-0. S2CID 2180634.
- ^ "BackType Website (defunct)". BackType. Archived from the original on July 15, 2011. Retrieved May 29, 2025.
- ^ Marz, Nathan; Warren, James (2015). Big Data: Principles and best practices of scalable real-time data systems. Manning Publications. pp. 156–198. ISBN 978-1617290343.
- ^ Rao, Leena (July 5, 2011). "Twitter Acquires Social Analytics Company BackType". TechCrunch. Retrieved May 29, 2025.
- ^ Harris, Derrick (March 28, 2012). "Twitter's Open Source Strategy". ZDNet. Retrieved May 29, 2025.
- ^ "Storm Project Incubation Status". Apache Software Foundation. Archived from the original on January 1, 2014. Retrieved May 29, 2025.
- ^ "Apache Storm Graduates to a Top-Level Project". Hortonworks. September 29, 2014. Archived from the original on October 1, 2014. Retrieved May 29, 2025.
- ^ "Apache Storm 1.0.0 Released". storm.apache.org. April 12, 2016. Retrieved May 29, 2025.
- ^ "Apache Storm 2.0.0 Released". storm.apache.org. May 30, 2019. Retrieved May 29, 2025.
- ^ Gualtieri, Mike (June 18, 2019). "The Forrester Wave: Streaming Analytics Q2 2019". Forrester Research. Retrieved May 29, 2025.
- ^ "Apache Storm 2.6.0 Released". storm.apache.org. November 22, 2023. Retrieved May 29, 2025.
- ^ "Apache Storm 2.7.1 Released". storm.apache.org. November 28, 2024. Retrieved May 29, 2025.
- ^ "Tutorial - Storm Topology". Documentation. Apache Storm. Retrieved May 29, 2025.
- ^ "Storm Architecture". storm.apache.org. Retrieved May 29, 2025.
- ^ Anderson, Quinton; Anderson, Sean T. (2013). Storm Real-Time Processing Cookbook. O'Reilly Media. pp. 23–45. ISBN 978-1782164425.
- ^ "Guaranteeing Message Processing". storm.apache.org. Retrieved May 29, 2025.
- ^ "Storm Documentation". storm.apache.org. Retrieved May 29, 2025.
- ^ Taylor, John (August 22, 2016). "SQL on Apache Storm". DZone. Retrieved May 29, 2025.
- ^ "State Checkpointing". storm.apache.org. Retrieved May 29, 2025.
- ^ "Storm Integrations". storm.apache.org. Retrieved May 29, 2025.
- ^ "Storm @ Yahoo! Scale". Yahoo Engineering. September 28, 2017. Retrieved May 29, 2025.
- ^ Jain, Ankit (2014). Learning Storm. O'Reilly Media. pp. 89–112. ISBN 978-1449361808.
- ^ "Real-time analytics at Twitter". Twitter Engineering. November 15, 2016. Retrieved May 29, 2025.
- ^ Wolfe, Brian (October 23, 2014). "How The Weather Channel Uses Apache Storm". InfoQ. Retrieved May 29, 2025.
- ^ Stream Processing at Alibaba with Apache Storm. Strata + Hadoop World. San Jose, CA. February 19, 2015.
- ^ Narkhede, Neha (June 14, 2023). "Stream Processing Engines: A Comprehensive Comparison". Confluent Blog. Retrieved May 29, 2025.
- ^ Kamburugamuve, Supun; Fox, Geoffrey (2017). "A Survey of Distributed Stream Processing Systems". ACM Computing Surveys. 49 (1): 1–35. doi:10.1145/2975568.
- ^ Kulkarni, Sanjeev; Bhagat, Nikunj; Fu, Maosong (May 31, 2015). Twitter Heron: Stream Processing at Scale. SIGMOD '15. pp. 239–250. doi:10.1145/2723372.2742788.
- ^ "Contributing to Storm". storm.apache.org. Retrieved May 29, 2025.
Further reading
[edit]- Marz, Nathan; Warren, James (2015). Big Data: Principles and best practices of scalable real-time data systems. Manning Publications. ISBN 978-1617290343.
- Allen, Matthew; Jankowski, Przemysław (2015). Storm Applied: Strategies for real-time event processing. Manning Publications. ISBN 978-1617291890.
- Jain, Ankit (2014). Learning Storm. O'Reilly Media. ISBN 978-1449361808.
External links
[edit]- Apache Software Foundation projects
- Cloud applications
- Cloud infrastructure
- Distributed computing architecture
- Distributed stream processing
- Free software programmed in Java (programming language)
- Java platform software
- Parallel computing
- Software using the Apache license
- Stream processing
- Twitter, Inc.
- 2011 software