One Does Not Simply Query a Stream

One Does Not Simply Query a Stream A landscape guide to querying your Kafka data Viktor Gamov

Who’s Here Tonight? —————————————————————————————————————————— • • • • How many of you are using Kafka in any capacity? (raise your hand) Keep your hands up if you’re doing Kafka Streams or Flink How many of you have tried to query data from Kafka? (like, SQL-style) How many gave up and just dumped it into Postgres? (no judgment… okay, a little judgment) X/Bluesky: @gamussa Toronto Elastic Meetup 2 / 48

How This Talk Was Born —————————————————————————————————————————— A couple of years ago, someone asked on Twitter… ▍ “When should I use Kafka Streams vs a real-time analytical database like Pinot?” And my boss said… “That might be a very good idea for a talk.” It turns out to be a very deep rabbit hole. Today you get the compacted version. Kind of like a distilled version. X/Bluesky: @gamussa Toronto Elastic Meetup 3 / 48

Before We Start —————————————————————————————————————————— • • • • • Slides and recording at speaking.gamov.io I’m Viktor Gamov Developer Advocate at Confluent Co-author of Kafka in Action (O’Reilly) Java Champion X/Bluesky: @gamussa Toronto Elastic Meetup Find me: • • • X/Bluesky: @gamussa GitHub: gAmUssA gamov.dev/rel 4 / 48

The Problem —————————————————————————————————————————— X/Bluesky: @gamussa Toronto Elastic Meetup 5 / 48

Simpler Times —————————————————————————————————————————— Once upon a time, you had a monolith. One application. One database. One SQL query. SELECT * FROM orders WHERE status = ‘pending’; Life was good. You could query anything. Then someone said “microservices” and it all went sideways. X/Bluesky: @gamussa Toronto Elastic Meetup 6 / 48

All Roads Lead to Kafka —————————————————————————————————————————— Your data is already in Kafka topics. • • • Events flow through topics in real time Data is immutable — once it’s written, it’s written Kafka is an append-only log, not a database So how do you query it? You can’t just SELECT * FROM kafka_topic. (Well… actually… we’ll get to that.) X/Bluesky: @gamussa Toronto Elastic Meetup 7 / 48

The Log —————————————————————————————————————————— Think of it like a ledger. You can add entries. You can read entries. You cannot update entries. Kind of like a conversation with your significant other. X/Bluesky: @gamussa Toronto Elastic Meetup 8 / 48

OLTP vs OLAP —————————————————————————————————————————— This is the most important slide in this talk. ▒▒▒▒ OLTP ▒▒▒▒ OLAP Transactional queries Analytical queries • • • • Get me order #12345 What’s the current balance? Point lookups by key Low latency, single record • • • • How many orders this hour? What’s the average basket size? Aggregations across millions Higher latency, many records Every solution we’ll see today optimizes for one of these. Keep this in your head. X/Bluesky: @gamussa Toronto Elastic Meetup 9 / 48

The Usual Suspects —————————————————————————————————————————— X/Bluesky: @gamussa Toronto Elastic Meetup 10 / 48

Our Options Tonight —————————————————————————————————————————— 1. 2. 3. 4. 5. 6. 7. 8. Kafka Connect + Relational Database Kafka Streams (embedded querying) Streaming SQL databases Real-Time OLAP databases Elasticsearch Cloud Data Warehouses Data Lakes + Table Formats Tableflow (the new kid) I will not give you a definitive answer. There are no right solutions. Only trade-offs. X/Bluesky: @gamussa Toronto Elastic Meetup 11 / 48

Option 1: Kafka Connect + RDBMS —————————————————————————————————————————— X/Bluesky: @gamussa Toronto Elastic Meetup 12 / 48

Connect + Database —————————————————————————————————————————— The “just dump it into Postgres” approach. X/Bluesky: @gamussa Toronto Elastic Meetup 13 / 48

When Connect + RDBMS Works —————————————————————————————————————————— • • • • • • •

You already know SQL Familiar tooling (pgAdmin, DBeaver, all this type of jazz) Great for smaller datasets OLTP-friendly — point lookups by key Not real-time — there’s a lag between Kafka and DB Doesn’t scale to millions of events/sec You’re maintaining another database If this works for you? It’s fine. It’s totally fine. This is not a bad thing. X/Bluesky: @gamussa Toronto Elastic Meetup 14 / 48

Option 2: Kafka Streams —————————————————————————————————————————— X/Bluesky: @gamussa Toronto Elastic Meetup 15 / 48

Kafka Streams —————————————————————————————————————————— What if you could query the stream from inside your application? StreamsBuilder builder = new StreamsBuilder(); KTable<String, Long> counts = builder .stream(“page-views”) .groupByKey() .count(); // Interactive Queries - the OLTP trick ReadOnlyKeyValueStore<String, Long> store = streams.store(StoreQueryParameters.fromNameAndType( “counts-store”, QueryableStoreTypes.keyValueStore())); Long count = store.get(“home-page”); X/Bluesky: @gamussa Toronto Elastic Meetup 16 / 48

Kafka Streams: The Baby Bird Analogy —————————————————————————————————————————— How does a consumer group work? Imagine you have a nest with baby birds. Mommy bird brings worms (events). Each baby bird (consumer) gets a worm. What happens if another baby bird hatches and there are 4 worms but 5 baby birds? ▍ “It dies.” Not entirely. It will be hungry. It will starve. But rebalancing will happen. X/Bluesky: @gamussa Toronto Elastic Meetup 17 / 48

When Kafka Streams Works —————————————————————————————————————————— • • • • • • •

No external database needed Embedded in your Java/Kotlin app Interactive Queries for OLTP lookups Exactly-once processing guarantees It’s a library, not a service — you manage the deployment Analytical queries (OLAP) are limited Scaling means scaling your app instances Congratulations, you built your own database. X/Bluesky: @gamussa Toronto Elastic Meetup 18 / 48

Option 3: Streaming SQL —————————————————————————————————————————— X/Bluesky: @gamussa Toronto Elastic Meetup 19 / 48

What’s a Streaming Database? —————————————————————————————————————————— • • • Takes a streaming source (Kafka) as first-class input Exposes a SQL interface for queries Maintains persistent state — materialized views updated continuously The dream: write SQL, get real-time results. X/Bluesky: @gamussa Toronto Elastic Meetup 20 / 48

The Candidates —————————————————————————————————————————— • • • ksqlDB (we’re not talking about this guy anymore) RisingWave — “Flink in Rust” (sue me) Timeplus — streaming analytics focus X/Bluesky: @gamussa R.I.P. ┌───────┐ │ksqlDB │ │ 2018- │ │ 2024 │ └───────┘ Toronto Elastic Meetup 21 / 48

“But Viktor, Flink Has SQL!” —————————————————————————————————————————— Yes. Yes it does. And Flink SQL is excellent for building streaming pipelines. But Flink is a processing framework, not a database. It doesn’t serve queries. It processes them and writes results somewhere else. Different tool, different job. X/Bluesky: @gamussa Toronto Elastic Meetup 22 / 48

RisingWave —————————————————————————————————————————— • • • • • Distributed SQL streaming database Built in Rust (the cool kids rejoice) PostgreSQL-compatible interface Kafka as first-class streaming source Connectors for Iceberg, S3, DynamoDB Think of it as… kind of like Flink reimagined as a database. X/Bluesky: @gamussa Toronto Elastic Meetup 23 / 48

When Streaming SQL Works —————————————————————————————————————————— • • • • • •

SQL interface — familiar for most developers Continuous materialized views — OLAP on streams No custom code needed Another service to deploy and manage Scaling characteristics vary wildly between products Community support is… developing X/Bluesky: @gamussa Toronto Elastic Meetup 24 / 48

Option 4: Real-Time OLAP —————————————————————————————————————————— X/Bluesky: @gamussa Toronto Elastic Meetup 25 / 48

Real-Time Analytical Databases —————————————————————————————————————————— What if you need sub-second queries over billions of events? ▒▒▒▒ The Players • • • ▒▒▒▒ Characteristics Apache Pinot (I used to build this) Apache Druid StarRocks X/Bluesky: @gamussa • • • • High concurrency queries Ultra-low latency Native Kafka ingestion Purpose-built for OLAP Toronto Elastic Meetup 26 / 48

Real-Time OLAP: When It Shines —————————————————————————————————————————— Think LinkedIn’s “Who Viewed Your Profile” — millions of users querying real-time analytics simultaneously. • • • • • •

Millisecond query latency at massive scale Built for concurrent analytical queries Kafka is a first-class data source Specialized — you’re adding a dedicated OLAP cluster Complex operational overhead Schema management can be… interesting Anyone using Apache Druid? StarRocks? X/Bluesky: @gamussa Toronto Elastic Meetup 27 / 48

Option 5: Elasticsearch —————————————————————————————————————————— X/Bluesky: @gamussa Toronto Elastic Meetup 28 / 48

Elasticsearch + Kafka —————————————————————————————————————————— Kafka → Elasticsearch — the search and analytics pipeline. Pipeline: Kafka Connect Elasticsearch Sink connector — battle-tested ingestion from streams. Analytics: Elasticsearch as a hybrid search + analytics engine — full-text search, aggregations, and Kibana dashboards out of the box. X/Bluesky: @gamussa Toronto Elastic Meetup 29 / 48

When Elasticsearch Works —————————————————————————————————————————— • • • • • • •

Full-text search — something none of the other options do well Kibana for dashboards and exploration out of the box Kafka Connect sink connector is battle-tested Hybrid: both point lookups (OLTP-ish) and aggregations (OLAP-ish) Not a streaming engine — it’s a destination, not a processor Schema mapping can get tricky with nested Avro/JSON Cluster sizing and shard management at scale But you all know this already. That’s why you’re here tonight. X/Bluesky: @gamussa Toronto Elastic Meetup 30 / 48

Option 6: Cloud Data Warehouses —————————————————————————————————————————— X/Bluesky: @gamussa Toronto Elastic Meetup 31 / 48

Cloud DWH + Kafka —————————————————————————————————————————— • • • • • •

Massive scale, managed service SQL interface everyone knows Kafka connectors available Batch-oriented — even “streaming” modes have latency Expensive at high volume Structured data bias — semi-structured gets messy Good for analytics. Not great for real-time. X/Bluesky: @gamussa Toronto Elastic Meetup 32 / 48

Option 7: Data Lakes + Table Formats —————————————————————————————————————————— X/Bluesky: @gamussa Toronto Elastic Meetup 33 / 48

The Data Lake Story —————————————————————————————————————————— “Just dump everything into S3. We’ll figure it out later.” Later never came. Then table formats appeared and saved us from ourselves: X/Bluesky: @gamussa Toronto Elastic Meetup 34 / 48

Data Lake Formats —————————————————————————————————————————— • • • Apache Iceberg — the one winning the adoption race Apache Hudi — Netflix’s baby Delta Lake — Databricks’ entry X/Bluesky: @gamussa Toronto Elastic Meetup 35 / 48

Data Lakes: The Good and the Ugly —————————————————————————————————————————— • • • • • •

Decoupled storage and compute Open table formats (Iceberg!) — no vendor lock-in Query with anything — Spark, Trino, DuckDB, Flink Not real-time — compaction cycles mean minutes to hours of delay Complex infrastructure to maintain Getting streaming data IN is the hard part X/Bluesky: @gamussa Toronto Elastic Meetup 36 / 48

Option 8: Tableflow —————————————————————————————————————————— X/Bluesky: @gamussa Toronto Elastic Meetup 37 / 48

Tableflow: Kafka Topics as Iceberg Tables —————————————————————————————————————————— What if your Kafka topic was already an Iceberg table? No connectors. No ETL. The topic is the table. X/Bluesky: @gamussa Toronto Elastic Meetup 38 / 48

When Tableflow Works —————————————————————————————————————————— • • • • • • •

Zero infrastructure to manage Open format (Iceberg) — query with any engine Automatic schema evolution from Schema Registry Data stays in your cloud storage Not real-time — compaction has latency (minutes) Confluent Cloud only (for now) Read-only — you query the table, not the stream It’s a bridge between streaming and lakehouse. Pretty cool, actually. X/Bluesky: @gamussa Toronto Elastic Meetup 39 / 48

So… Which One? —————————————————————————————————————————— X/Bluesky: @gamussa Toronto Elastic Meetup 40 / 48

The Decision Framework —————————————————————————————————————————— No ideal solutions. Only trade-offs. X/Bluesky: @gamussa ┌──────────────────────────────────────────────┐ │ │ │ OLTP (point lookups) → Kafka Streams │ │ Connect + RDBMS │ │ │ │ OLAP (analytics) → Real-Time OLAP │ │ Streaming SQL │ │ │ │ Search + hybrid → Elasticsearch │ │ │ │ Batch analytics → Data Lake │ │ Cloud DWH │ │ Tableflow │ │ │ └──────────────────────────────────────────────┘ Toronto Elastic Meetup 41 / 48

Three Things to Consider —————————————————————————————————————————— 1. Familiarity — sometimes you go with what you know. And that’s not a bad thing. 2. Performance — if consumer lag keeps you up at night, look at Pinot or StarRocks 3. Community — when you’re choosing a solution, think about where you can go ask questions X/Bluesky: @gamussa Toronto Elastic Meetup 42 / 48

The Community Thing —————————————————————————————————————————— When I was learning to program, I picked Pascal. Not because it was the best language. Because there was a guy in my neighborhood who knew Pascal. When I got stuck, I could walk over and ask him. That’s why Kafka won. Not because it’s perfect. Because people at meetups like this one could eat pizza and help each other figure it out. X/Bluesky: @gamussa Toronto Elastic Meetup 43 / 48

What to Try This Week —————————————————————————————————————————— 1. 2. 3. 4. 5. Already using Kafka? Try Interactive Queries with Kafka Streams Want SQL on streams? Spin up RisingWave or Materialize locally Need analytics at scale? Look at Pinot + Kafka connector On Confluent Cloud? Enable Tableflow on a topic and query with DuckDB Come talk to me — I’ll be around after. Pizza first. X/Bluesky: @gamussa Toronto Elastic Meetup 44 / 48

Resources —————————————————————————————————————————— • • • • Slides: speaking.gamov.io Streaming Frontiers: Live stream series (every 4 weeks) Confluent dev resources: developer.confluent.io Book: Kafka in Action (Manning) X/Bluesky: @gamussa Toronto Elastic Meetup 45 / 48

As always, have a nice day. —————————————————————————————————————————— Viktor Gamov — X/Bluesky: @gamussa

Questions? —————————————————————————————————————————— I’ll be in the hallway track. Find me near the pizza. X/Bluesky: @gamussa

X/Bluesky: @gamussa Toronto Elastic Meetup 48 / 48

One Does Not Simply Query a Stream

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48