SF Scala Notes (1/2)

Druid + Tranquility

Smart clients with Scala, Finagle, and Curator

Gian Merlino*

These notes are my own, and only contain a rough approximation of the talk. All of the good ideas belong to the presenter and the errors are my own. Video will eventually appear @ http://www.meetup.com/SF-Scala/events/213630542/ according to the organizers

Background

They wanted to be able to interact with arbitrary time series values at very high scale and very high efficiency. They wanted
to explore

  • Arbitrary and interactive exploration
  • Multi Tenancy with thousands of concurrent users
  • Recency matters, they often needed to alert on major changes.
  • Scalability & Availability really mattered for these products.

The first few attempts at a solution where using hadoop to pipe all of the data as an RDBMS, then a NoSQL K/V stores split amoung a number of dimensions, then they considered but didn't actually pipe them into commercial databases.

Druid
  • Started in 2011, open sourced in 2012, ~40 contributors.
  • Designed for low latency ingestion and aggregation
  • ~1 queries on very large datasets.
Raw Data

Flat files, timestamps are truncated, grouped by over string columns, and aggregate numeric columns.

Data is stored in immutable chunks which also have thread affinity. They use a column orientation so they can only select the data they need.

Data architecture

Data Producers -> Streams Druid Realtime Workers -> Segment Handoff -> Historical Collector

Real Time Ingestion Streams are pushed to an incremental index, then a partial segment on desk, followed by a merged segment which is stored on disk

Tranquility

Tranquility requests some workers from an "overlord". The overlord
gives them some worker ids and the pushes partitioned, replicated data to those workers. It finds the overlord via curator, which they extended finalge to contact overloard semi directly.

Split batches are rolled up by timestamp and dimensions and distribute this data to all of the replicas in that partition. They're sent via Finagle async http post futures.

Coordination:

They use distributed locking with Curator and then use the metadata to coordinate tasks. [I didn't completely understand the exact purpose of this piece of the talk.]

Take Aways:
  • Druid was built to power interactive applicaitons
  • Druid handles ingestion and querying of data segments.
  • Tranquility handles partitioning, replication, and coordination.