Interesting Stuff - Week 17, 2023

Posted by nielsb on Sunday, April 30, 2023

This week: beating roulette, hooking up Azure Data Explorer and Kafka. Real-time machine learning pipelines.

Spark loader for Hugging Face datasets, processing data from multiple streaming platforms using Delta Live Tables and the real-time streaming ecosystem.

Misc.

  • THE GAMBLER WHO BEAT ROULETTE. OK, this is not online gambling, but the story is fascinating. How a guy figured out how to beat roulette without any computers or anything. Well worth a read.

Azure Data Explorer

  • Stream Data from Kafka to Azure Data Explorer. As you know, I am a big fan of Apache Kafka and Azure Data Explorer (ADX). I have written several blog posts about these two technologies, and I have a section about setting up Kafka and ADX in each post. I thought it would be beneficial to summarise the details, and the repo linked to and its README.md file is that summary.

AI/ML

  • Real Time ML Pipelines Using Quix with Tomáš Neubauer. This is an InfoQ podcast where Tomáš Neubauer talks about Quix Streams, an open-source Python library that simplifies real-time machine learning pipelines. Tomáš will discuss various architectural designs, their pros and cons, and some actual use cases. Very interesting!
  • Databricks ❤️ Hugging Face. This is a blog post from Databricks about how they have contributed a Spark loader for Hugging Face datasets. Hugging Face is a company with a large dataset collection for NLP. The datasets are available in various formats, including CSV, JSON, and Parquet. The blog post shows how to use the Spark loader to load the datasets into Spark DataFrames.

Streaming

  • Processing data simultaneously from multiple streaming platforms using Delta Live Tables. This is a blog post from Databricks about using Delta Live Tables to process data from multiple streaming platforms. The blog post shows how to use Delta Live Tables to process data from Kafka, Azure Event Hubs, and Azure IoT Hub. The blog post also shows how to use Delta Live Tables to process data from multiple streaming platforms simultaneously.
  • Real-Time Streaming Ecosystem Part 1. This post is number one in a series about the real-time streaming ecosystem. In this post, the author has compiled all the open source and vendor real-time solutions into an end-to-end real-time analytical use case. The post is very interesting, and I look forward to the series’s next posts.

~ Finally

That’s all for this week. I hope you enjoy what I did put together. Please comment on this post or ping me if you have ideas for what to cover.


comments powered by Disqus