How Do You Query a Stream?

A presentation at jPrime 2024 in May 2024 in Sofia, Bulgaria by Viktor Gamov

Slide 1

How To Query A Stream? Viktor Gamov, StarTree @gamussa Geecon, Kraków, 2024 @gamussa | @startreedata | @apachepinot

Slide 2

@gamussa | @startreedata | @apachepinot

Slide 3

@gamussa | @startreedata | @apachepinot

Slide 4

Viktor GAMOV Head of Developer Advocacy | StarTree f THE CLOUD CONNECTIVITY COMPANY Twitter X: @gamussa Kong Con idential

Slide 5

Simpler times Monolith

Slide 6

Simpler analytics ETL and CDC

Slide 7

DHW->Hadoop Mobile Era

Slide 8

Data Pipelines Streaming data pipelines and Microservices

Slide 9

LOG

Slide 10

OLTP stream vs OLAP vs. OLTP in Streams OLAP streams

Slide 11

Our Options f • • • • • • Connect/Relational DB Ka ka Streams Streaming SQL Cloud Data Warehouse Data Lake Real-Time OLAP Database

Slide 12

f Ka ka Connect

Slide 13

Connect/RDBMS Broker Broker Broker Cluster Data Source Kafka Connect Kafka Connect Data Sink

Slide 14

` Connect/RDBMS • Suitable for smaller data • Transactional • Familiar to users

Slide 15

f Ka ka Streams

Slide 16

Ka ka Streams (transactional) f • Ingests directly from a topic • KTable • Forms an in-memory key/value store suitable for querying by topic key • Scalable across members of a consumer group • Readable through Interactive Queries

Slide 17

Ka ka Streams (transactional) final KStream<String, String> stream = builder.stream(inputTopic, Consumed.with(stringSerde, stringSerde)); f final KTable<String, String> convertedTable = stream.toTable(Materialized.as(“streamconverted-to-table”));

Slide 18

Ka ka Streams (analytical) • • • • • Full-featured Java stream processing API Arbitrary streaming computation Can emit new streams (not this talk) KTables queryable by key f Every read pattern requires its own topology • Interactive Queries again

Slide 19

Ka ka Streams (analytical) KTable<String, Long> wordCounts = textLines .flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split(“\W+”))) .groupBy((key, word) -> word) .count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>>as(“counts-store”)); f wordCounts.toStream().to(“WordsWithCountsTopic”, Produced.with(Serdes.String(), Serdes.Long()));

Slide 20

Streaming SQLs

Slide 21

Streaming SQL • • • • Materialize DeltaStream RisingWave ksqlDB

Slide 22

I love you, little squirrel. Why not Flink?

Slide 23

But this talk is not about you. Why not Flink?

Slide 24

Materialize f • Replacement data warehouse • Integrates with Ka ka, Postgres, dbt • The Materialized View is the central abstraction • Views are persistent and queryable • Postgres wire-compatible • Positioned as an analytics solution

Slide 25

Delta Stream Cloud-native streaming SQL Serverless, BYOC Ka ka, Kinesis integration Materialized views and streaming pipelines • streaming database and streaming analytics f • • • •

Slide 26

Rising Wave f • Distributed SQL Streaming database • Cloud and OSS versions • Implementation of Flink in Rust • Ka ka, Pulsar, Kinesis integrations • Flink+persistent views • Postgres wire-compatible

Slide 27

ksqlDB f • «Streaming Database» • Provides persistent TABLE abstraction • Pull and Push queries • Like Ka kaStreams, but in SQL

Slide 28

Cloud Data Warehouses

Slide 29

Cloud Data Warehouses

Slide 30

Cloud Data Warehouses • The cloud-based heir of legacy DWH • Ingest from batch and streaming sources • Biased towards structured data and batch access

Slide 31

Data Lake

Slide 32

Data Lake f Anything else We’ll igure this out

Slide 33

Data Lakes • • • • • Started as the HDFS cluster Became S3 That didn’t help… ELT vs. ETL Iceberg/Hudi/DeltaLake

Slide 34

Data Lakes f • Storage and compute are radically decoupled • Structure is relatively less important • Reads are slow • Streaming is historically dif icult

Slide 35

Data Lakes f • Storage and compute are radically decoupled • Structure is relatively less important • Reads are slow • Streaming is historically dif icult

Slide 36

Real-Time Analytics Database

Slide 37

Real-Time OLAP f • Designed for high concurrency, low latency queries • Ingests from streaming and batch sources • Intimate integration with Ka ka • Conventional tables and SQL