How To Query A Stream? Viktor Gamov, StarTree @gamussa
Geecon, Kraków, 2024 @gamussa | @startreedata | @apachepinot
Slide 2
@gamussa | @startreedata | @apachepinot
Slide 3
@gamussa | @startreedata | @apachepinot
Slide 4
Viktor GAMOV Head of Developer Advocacy | StarTree
f
THE CLOUD CONNECTIVITY COMPANY
Twitter X: @gamussa
Kong Con idential
Slide 5
Simpler times Monolith
Slide 6
Simpler analytics ETL and CDC
Slide 7
DHW->Hadoop Mobile Era
Slide 8
Data Pipelines Streaming data pipelines and Microservices
Slide 9
LOG
Slide 10
OLTP stream vs OLAP vs. OLTP in Streams OLAP streams
Slide 11
Our Options
f
• • • • • •
Connect/Relational DB Ka ka Streams Streaming SQL Cloud Data Warehouse Data Lake Real-Time OLAP Database
Slide 12
f
Ka ka Connect
Slide 13
Connect/RDBMS Broker Broker Broker Cluster Data Source
Kafka Connect
Kafka Connect
Data Sink
Slide 14
`
Connect/RDBMS • Suitable for smaller data • Transactional • Familiar to users
Slide 15
f
Ka ka Streams
Slide 16
Ka ka Streams (transactional)
f
• Ingests directly from a topic • KTable • Forms an in-memory key/value store suitable for querying by topic key • Scalable across members of a consumer group • Readable through Interactive Queries
Slide 17
Ka ka Streams (transactional)
final KStream<String, String> stream = builder.stream(inputTopic, Consumed.with(stringSerde, stringSerde));
f
final KTable<String, String> convertedTable = stream.toTable(Materialized.as(“streamconverted-to-table”));
Slide 18
Ka ka Streams (analytical) • • • • •
Full-featured Java stream processing API Arbitrary streaming computation Can emit new streams (not this talk) KTables queryable by key
f
Every read pattern requires its own topology • Interactive Queries again
Slide 19
Ka ka Streams (analytical)
KTable<String, Long> wordCounts = textLines .flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split(“\W+”))) .groupBy((key, word) -> word) .count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>>as(“counts-store”));
f
wordCounts.toStream().to(“WordsWithCountsTopic”, Produced.with(Serdes.String(), Serdes.Long()));
Materialize
f
• Replacement data warehouse • Integrates with Ka ka, Postgres, dbt • The Materialized View is the central abstraction • Views are persistent and queryable • Postgres wire-compatible • Positioned as an analytics solution
Slide 25
Delta Stream
Cloud-native streaming SQL Serverless, BYOC Ka ka, Kinesis integration Materialized views and streaming pipelines • streaming database and streaming analytics
f
• • • •
Slide 26
Rising Wave
f
• Distributed SQL Streaming database • Cloud and OSS versions • Implementation of Flink in Rust • Ka ka, Pulsar, Kinesis integrations • Flink+persistent views • Postgres wire-compatible
Slide 27
ksqlDB
f
• «Streaming Database» • Provides persistent TABLE abstraction • Pull and Push queries • Like Ka kaStreams, but in SQL
Slide 28
Cloud Data Warehouses
Slide 29
Cloud Data Warehouses
Slide 30
Cloud Data Warehouses • The cloud-based heir of legacy DWH • Ingest from batch and streaming sources • Biased towards structured data and batch access
Slide 31
Data Lake
Slide 32
Data Lake
f
Anything else
We’ll igure this out
Slide 33
Data Lakes • • • • •
Started as the HDFS cluster Became S3 That didn’t help… ELT vs. ETL Iceberg/Hudi/DeltaLake
Slide 34
Data Lakes
f
• Storage and compute are radically decoupled • Structure is relatively less important • Reads are slow • Streaming is historically dif icult
Slide 35
Data Lakes
f
• Storage and compute are radically decoupled • Structure is relatively less important • Reads are slow • Streaming is historically dif icult
Slide 36
Real-Time Analytics Database
Slide 37
Real-Time OLAP
f
• Designed for high concurrency, low latency queries • Ingests from streaming and batch sources • Intimate integration with Ka ka • Conventional tables and SQL
Slide 38
Real-Time OLAP • Analytics shaped like realtime data • Analytics when users are decision makers
Slide 39
No Solutions Technology Selection only Trade Offs
Slide 40
Sometimes you go with what you know
Slide 41
This is not bad!
Slide 42
Performance
Performance
Slide 43
Community/Adoption
Community
Slide 44
Differentiated Application Code
Area of Exploration Kafka
Slide 45
Need your feedback
@gamussa | @startreedata | @apachepinot
Slide 46
Need your feedback It’s Anonymous
@gamussa | @startreedata | @apachepinot
Slide 47
For more resources on Apache Pinot: dev.startree.ai Viktor Gamov, StarTree @gamussa