Druid + Tranquility
Smart clients with Scala, Finagle, and Curator
These notes are my own, and only contain a rough approximation of the talk. All of the good ideas belong to the presenter and the errors are my own. Video will eventually appear @ http://www.meetup.com/SF-Scala/events/213630542/ according to the organizers
They wanted to be able to interact with arbitrary time series values at very high scale and very high efficiency. They wanted
- Arbitrary and interactive exploration
- Multi Tenancy with thousands of concurrent users
- Recency matters, they often needed to alert on major changes.
- Scalability & Availability really mattered for these products.
The first few attempts at a solution where using hadoop to pipe all of the data as an RDBMS, then a NoSQL K/V stores split amoung a number of dimensions, then they considered but didn't actually pipe them into commercial databases.
- Started in 2011, open sourced in 2012, ~40 contributors.
- Designed for low latency ingestion and aggregation
- ~1 queries on very large datasets.
Flat files, timestamps are truncated, grouped by over string columns, and aggregate numeric columns.
Data is stored in immutable chunks which also have thread affinity. They use a column orientation so they can only select the data they need.
Data Producers -> Streams Druid Realtime Workers -> Segment Handoff -> Historical Collector
Real Time Ingestion Streams are pushed to an incremental index, then a partial segment on desk, followed by a merged segment which is stored on disk
Tranquility requests some workers from an "overlord". The overlord
gives them some worker ids and the pushes partitioned, replicated data to those workers. It finds the overlord via curator, which they extended finalge to contact overloard semi directly.
Split batches are rolled up by timestamp and dimensions and distribute this data to all of the replicas in that partition. They're sent via Finagle async http post futures.
They use distributed locking with Curator and then use the metadata to coordinate tasks. [I didn't completely understand the exact purpose of this piece of the talk.]
- Druid was built to power interactive applicaitons
- Druid handles ingestion and querying of data segments.
- Tranquility handles partitioning, replication, and coordination.