One Does Not Simply Query a Stream A landscape guide to querying your Kafka data

Before We Start I’m Viktor Gamov Developer Advocate at Confluent Co-author of Kafka in Action (Manning) Java Champion X / Bluesky: @gamussa Slides + video: speaking.gamov.io @gamussa | Chicago Elastic Meetup 4 / 34

Slides + Video scan for slides, speaker notes, and the recording speaking.gamov.io

@gamussa | Chicago Elastic Meetup 5 / 34

Simpler Times Once upon a time, you had a monolith. One application. One database. One SQL query. SELECT * FROM orders WHERE status = ‘pending’; Life was good. You could query anything. Then someone said “microservices” — and it all went sideways. @gamussa | Chicago Elastic Meetup 6 / 34

All Roads Lead to Kafka Your data is already in Kafka topics. Events flow through topics in real time. Data is immutable — once written, it’s written. Kafka is an append-only log, not a database. So how do you query it? You can’t just SELECT * FROM kafka_topic. (Well… actually… we’ll get to that.) @gamussa | Chicago Elastic Meetup 7 / 34

@gamussa | Chicago Elastic Meetup 8 / 34

@gamussa | Chicago Elastic Meetup 14 / 34

OLTP vs OLAP FRAMING This is the most important slide in this talk. OLTP — Transactional queries Get me order #12345 What’s the current balance? Point lookups by key Low latency, single record | | | | | OLAP — Analytical queries How many orders this hour? What’s the average basket size? Aggregations across millions Higher latency, many records Every solution we’ll see optimizes for one of these. Keep this in your head. @gamussa | Chicago Elastic Meetup 9 / 34

Our Options Tonight 1. 2. 3. 4. 5. 6. 7. 8. TABLE OF CONTENTS Kafka Connect + Relational Database Kafka Streams (embedded querying) Streaming SQL databases Real-Time OLAP databases Elasticsearch Cloud Data Warehouses Data Lakes + Table Formats Tableflow (the new kid) I will not give you a definitive answer. There are no right solutions. Only trade-offs. @gamussa | Chicago Elastic Meetup 10 / 34

@gamussa | Chicago Elastic Meetup 11 / 34

When Connect + RDBMS Works + + + + You already know SQL Familiar tooling (pgAdmin, DBeaver, all that jazz) Great for smaller datasets OLTP-friendly — point lookups by key

Not real-time — there’s a lag Doesn’t scale to millions of events/sec You’re maintaining another database If this works for you? @gamussa | Chicago Elastic Meetup It’s fine. REPORT CARD It’s totally fine. 12 / 34

@gamussa | Chicago Elastic Meetup 13 / 34

When Kafka Streams Works + + + + No external database needed Embedded in your Java/Kotlin app Interactive Queries for OLTP lookups Exactly-once processing

It’s a library, not a service — you manage deployment Analytical queries (OLAP) are limited Scaling = scaling your app instances REPORT CARD Congratulations, you built your own database. @gamussa | Chicago Elastic Meetup 15 / 34

@gamussa | Chicago Elastic Meetup 16 / 34

When Streaming SQL Works + + + SQL interface — familiar Continuous materialized views — OLAP on streams No custom code needed

Another service to deploy and manage Scaling characteristics vary wildly between products Community support is… developing REPORT CARD Sidebar: Flink has SQL too — but Flink is a processing framework, not a database. Different tool, different job. @gamussa | Chicago Elastic Meetup 17 / 34

@gamussa | Chicago Elastic Meetup 18 / 34

When Real-Time OLAP Works + + + Millisecond query latency at massive scale Built for concurrent analytical queries Kafka is a first-class data source

Specialized — dedicated OLAP cluster Complex operational overhead Schema management can be… interesting @gamussa | Chicago Elastic Meetup REPORT CARD 19 / 34

@gamussa | Chicago Elastic Meetup 20 / 34

When Elasticsearch Works + + + + Full-text search — something none of the others do well Kibana for dashboards out of the box Kafka Connect sink is battle-tested Hybrid: point lookups (OLTP-ish) + aggregations (OLAP-ish)

Not a streaming engine — destination, not processor Schema mapping can get tricky with nested Avro/JSON Cluster sizing and shard management at scale REPORT CARD But you all know this already. That’s why you’re here tonight. @gamussa | Chicago Elastic Meetup 21 / 34

OPTION 6 — CLOUD DATA WAREHOUSES FIG. 12

Massive scale, managed service SQL interface everyone knows Kafka connectors available

Batch-oriented — even “streaming” modes have latency Expensive at high volume Structured data bias — semi-structured gets messy Good for analytics. @gamussa | Chicago Elastic Meetup Not great for real-time. 22 / 34

@gamussa | Chicago Elastic Meetup 23 / 34

@gamussa | Chicago Elastic Meetup 24 / 34

@gamussa | Chicago Elastic Meetup 25 / 34

@gamussa | Chicago Elastic Meetup 26 / 34

The Decision Framework OLTP (point lookups) -> Kafka Streams Connect + RDBMS OLAP (analytics) -> Real-Time OLAP Streaming SQL Search + hybrid -> Elasticsearch Batch analytics -> Data Lake Cloud DWH Tableflow No ideal solutions. @gamussa | Chicago Elastic Meetup DECISION KEY Only trade-offs. 27 / 34

Three Things to Consider 1. Familiarity Sometimes you go with what you know. That’s NOT a bad thing. 2. Performance If consumer lag keeps you up at night, look at Pinot or StarRocks. 3. Community When you’re choosing, think about where you can go ask questions. @gamussa | Chicago Elastic Meetup THREE CONSIDERATIONS 28 / 34

The Community Thing I picked Pascal to learn — not because it was the best language, but because there was a guy in the neighborhood who could help me. That’s why Kafka won. Not because it’s perfect — because people at meetups like this one eat pizza and help each other figure it out. @gamussa | Chicago Elastic Meetup 29 / 34

What to Try This Week 1. Already using Kafka? Try Interactive Queries with Kafka Streams. Want SQL on streams? Spin up RisingWave or Materialize locally. Need analytics at scale? Look at Pinot + Kafka connector. On Confluent Cloud? Enable Tableflow on a topic and query with DuckDB. Already using Elasticsearch? Try the Kafka Connect ES sink. Come talk to me — I’ll be around after. Pizza first. 2. 3. 4. 5. 6. @gamussa ACTION ITEMS | Chicago Elastic Meetup 30 / 34

Resources Slides + video Book Confluent dev Streaming Frontiers @gamussa | Chicago Elastic Meetup REFERENCES speaking.gamov.io Kafka in Action (Manning) developer.confluent.io my live-stream series 31 / 34

AS ALWAYS, HAVE A NICE DAY