Interesting Stuff - Week 5, 2021

Throughout the week, I read a lot of blog-posts, articles, and so forth, that has to do with things that interest me:

This blog-post is the “roundup” of the things that have been most interesting to me, for the week just ending.

Big Data

Intro to Apache Pinot. In last weeks roundup, I posted a video link about doing real-time analytics using Apache Pinto and Kafka. What I have linked to here is to an awesome video introducing what Pinot is. If you are interested, it is a must-see!
Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. In some of the previous roundups I have written about Data Meshes, and how the Data Mesh is a hot topic today in the Big Data world. The video I have linked to here discusses another hot topic: the Lakehouse architecture. A Lakehouse is a data management system based on lowcost and directly-accessible storage that also provides traditional analytical DBMS management and performance features.
A Short Introduction to Apache Iceberg. Part of the Lakehouse architecture is the table format. The table format allows for ACID transaction capability as well as data versioning, etc. Some table formats out there are Databricks Delta Lake, Apache Hudi, and Apache Iceberg. The post linked to here looks at Apache Iceberg, and what we can do with it.

Introducing seamless integration between Microsoft Azure and Confluent Cloud. Well, I guess the title says it all! We finally have a transparent integration between Azure and Confluent Cloud. Hopefully, we’ll now start to see posts from the Confluent guys, (and girls), where they do “cool stuff” on Azure and not only AWS and Google Cloud.
Streaming Machine Learning with Apache Kafka and without another Data Lake by Kai Waehner. Usually, when we do Machine Learning, both training and inference, we use a data lake - perhaps even a Lakehouse as mentioned above. But it’s possible to avoid such a data store altogether, using an event streaming architecture. The video linked to explains how this can be achieved leveraging Apache Kafka, Tiered Storage and TensorFlow.

That’s all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.