[A version of this post appears on the O'Reilly Strata blog.]
One of the keys to Twitter’s ability to process 500 millions tweets daily is a software development process that values monitoring and measurement. A recent post from the company’s Observability team detailed the software stack for monitoring the performance characteristics of software services, and alert teams when problems occur. The Observability stack collects 170 million individual metrics (time-series) every minute and serves up 200 million queries per day. Simple query tools are used to populate charts and dashboards (a typical user monitors about 47 charts).
The stack is about three years old1 and consists of instrumentation2 (data collection primarily via Finagle), storage (Apache Cassandra), a query language and execution engine3, visualization4, and basic analytics. Four distinct Cassandra clusters are used to serve different requirements (real-time, historical, aggregate, index). A lot of engineering work went into making these tools as simple to use as possible. The end result is that these different pieces provide a flexible and interactive framework for developers: insert a few lines of (instrumentation) code and start viewing charts within minutes5.
The Observability stack’s suite of analytic functions is a work in progress – only simple tools are currently available. Potential anomalies are highlighted visually and users can input simple alerts (“if the value exceeds 100 for 10 minutes, alert me”). While rule-based alerts are useful, they cannot proactively detect unexpected problems (or “unknown unknowns”). When faced with tracking a large number of time series, correlations are essential: if one time series signals an anomaly, it’s critical to know what others one should be worried about. In place of automatic correlation detection, for now Observability users leverage Zipkin (a distributed tracing system) to identify service dependencies. But its solid technical architecture should allow the Observability team to easily expand its analytic capabilities. Over the coming months, the team plans to add tools6 for pattern matching (search), and automatic correlation and anomaly detection.
While latency requirements tend to grab headlines (e.g. high frequency trading), Twitter’s Observability stack addresses a more common pain point: managing and mining many millions of time-series. In an earlier post I noted that many interesting systems developed for monitoring IT Operations are beginning to tackle this problem. As self-tracking apps continue to proliferate, massively scalable backend systems for time series need to be built. So while I appreciate Twitter’s decision to open source Summingbird, I think just as many users will want to get their hands on an open source version of their Observability stack. I certainly hope the company decides to open source it in the near future.
- Surfacing anomalies and patterns in Machine Data
- The re-emergence of time-series
- 11 Essential Features that Visual Analysis Tools Should Have
(0) Thanks to Franklin Hu and Yann Ramin of Twitter for walking me through the details of the Observability stack.
(1) The precursor to the Observability stack was a system that relied on tools like Ganglia and Nagios.
(2) “Just as easy as adding a print statement.”
(3) In-house tools written in Scala, the queries are written in a “declarative, functional inspired language”. In order to achieve near real-time latency, in-memory caching techniques are used.
(5) The system is best described as near real-time. Or more precisely, human real-time (since humans are still in the loop).
(6) Dynamic time warping at massive scale is on their radar. Since historical data is archived, simulation tools (for what-if scenario analysis) are possible but currently not planned. In an earlier post I highlighted one such tool from CloudPhysics.