One Does Not Simply Query a Stream A landscape guide to querying your Kafka data

Before We Start I’m Viktor Gamov Developer Advocate at Confluent Co-author of Kafka in Action (Manning) Java Champion X / Bluesky: @gamussa Slides + video: speaking.gamov.io @gamussa | NYC Elastic Meetup 4 / 34

Slides + Video scan for slides, speaker notes, and the recording speaking.gamov.io

@gamussa | NYC Elastic Meetup 5 / 34

Simpler Times Once upon a time, you had a monolith. One application. One database. One SQL query. SELECT * FROM orders WHERE status = ‘pending’; Life was good. You could query anything. Then someone said “microservices” — and it all went sideways. @gamussa | NYC Elastic Meetup 6 / 34

All Roads Lead to Kafka Your data is already in Kafka topics. Events flow through topics in real time. Data is immutable — once written, it’s written. Kafka is an append-only log, not a database. So how do you query it? You can’t just SELECT * FROM kafka_topic. (Well… actually… we’ll get to that.) @gamussa | NYC Elastic Meetup 7 / 34

@gamussa | NYC Elastic Meetup 8 / 34

@gamussa | NYC Elastic Meetup 14 / 34

OLTP vs OLAP FRAMING This is the most important slide in this talk. OLTP — Transactional queries Get me order #12345 What’s the current balance? Point lookups by key Low latency, single record | | | | | OLAP — Analytical queries How many orders this hour? What’s the average basket size? Aggregations across millions Higher latency, many records Every solution we’ll see optimizes for one of these. Keep this in your head. @gamussa | NYC Elastic Meetup 9 / 34

Our Options Tonight 1. 2. 3. 4. 5. 6. 7. 8. TABLE OF CONTENTS Kafka Connect + Relational Database Kafka Streams (embedded querying) Streaming SQL databases Real-Time OLAP databases Elasticsearch Cloud Data Warehouses Data Lakes + Table Formats Tableflow (the new kid) I will not give you a definitive answer. There are no right solutions. Only trade-offs. @gamussa | NYC Elastic Meetup 10 / 34

@gamussa | NYC Elastic Meetup 11 / 34

When Connect + RDBMS Works + + + + You already know SQL Familiar tooling (pgAdmin, DBeaver, all that jazz) Great for smaller datasets OLTP-friendly — point lookups by key

Not real-time — there’s a lag Doesn’t scale to millions of events/sec You’re maintaining another database If this works for you? @gamussa | NYC Elastic Meetup It’s fine. REPORT CARD It’s totally fine. 12 / 34

@gamussa | NYC Elastic Meetup 13 / 34

When Kafka Streams Works + + + + No external database needed Embedded in your Java/Kotlin app Interactive Queries for OLTP lookups Exactly-once processing

It’s a library, not a service — you manage deployment Analytical queries (OLAP) are limited Scaling = scaling your app instances REPORT CARD Congratulations, you built your own database. @gamussa | NYC Elastic Meetup 15 / 34

@gamussa | NYC Elastic Meetup 16 / 34

When Streaming SQL Works + + + SQL interface — familiar Continuous materialized views — OLAP on streams No custom code needed

Another service to deploy and manage Scaling characteristics vary wildly between products Community support is… developing REPORT CARD Sidebar: Flink has SQL too — but Flink is a processing framework, not a database. Different tool, different job. @gamussa | NYC Elastic Meetup 17 / 34

@gamussa | NYC Elastic Meetup 18 / 34

When Real-Time OLAP Works + + + Millisecond query latency at massive scale Built for concurrent analytical queries Kafka is a first-class data source

Specialized — dedicated OLAP cluster Complex operational overhead Schema management can be… interesting @gamussa | NYC Elastic Meetup REPORT CARD 19 / 34

@gamussa | NYC Elastic Meetup 20 / 34

When Elasticsearch Works + + + + Full-text search — something none of the others do well Kibana for dashboards out of the box Kafka Connect sink is battle-tested Hybrid: point lookups (OLTP-ish) + aggregations (OLAP-ish)

Not a streaming engine — destination, not processor Schema mapping can get tricky with nested Avro/JSON Cluster sizing and shard management at scale REPORT CARD But you all know this already. That’s why you’re here tonight. @gamussa | NYC Elastic Meetup 21 / 34

OPTION 6 — CLOUD DATA WAREHOUSES FIG. 12

Massive scale, managed service SQL interface everyone knows Kafka connectors available

Batch-oriented — even “streaming” modes have latency Expensive at high volume Structured data bias — semi-structured gets messy Good for analytics. @gamussa | NYC Elastic Meetup Not great for real-time. 22 / 34

@gamussa | NYC Elastic Meetup 23 / 34

@gamussa | NYC Elastic Meetup 24 / 34

@gamussa | NYC Elastic Meetup 25 / 34

@gamussa | NYC Elastic Meetup 26 / 34

The Decision Framework OLTP (point lookups) -> Kafka Streams Connect + RDBMS OLAP (analytics) -> Real-Time OLAP Streaming SQL Search + hybrid -> Elasticsearch Batch analytics -> Data Lake Cloud DWH Tableflow No ideal solutions. @gamussa | NYC Elastic Meetup DECISION KEY Only trade-offs. 27 / 34

Three Things to Consider 1. Familiarity Sometimes you go with what you know. That’s NOT a bad thing. 2. Performance If consumer lag keeps you up at night, look at Pinot or StarRocks. 3. Community When you’re choosing, think about where you can go ask questions. @gamussa | NYC Elastic Meetup THREE CONSIDERATIONS 28 / 34

The Community Thing I picked Pascal to learn — not because it was the best language, but because there was a guy in the neighborhood who could help me. That’s why Kafka won. Not because it’s perfect — because people at meetups like this one eat pizza and help each other figure it out. @gamussa | NYC Elastic Meetup 29 / 34

What to Try This Week 1. Already using Kafka? Try Interactive Queries with Kafka Streams. Want SQL on streams? Spin up RisingWave or Materialize locally. Need analytics at scale? Look at Pinot + Kafka connector. On Confluent Cloud? Enable Tableflow on a topic and query with DuckDB. Already using Elasticsearch? Try the Kafka Connect ES sink. Come talk to me — I’ll be around after. Pizza first. 2. 3. 4. 5. 6. @gamussa ACTION ITEMS | NYC Elastic Meetup 30 / 34

Resources Slides + video Book Confluent dev Streaming Frontiers @gamussa | NYC Elastic Meetup REFERENCES speaking.gamov.io Kafka in Action (Manning) developer.confluent.io my live-stream series 31 / 34

AS ALWAYS, HAVE A NICE DAY