Interesting Stuff - Week 16, 2023

Posted by nielsb on Sunday, April 23, 2023

A mixed bag this week: Azure Databricks and its Delta engine. In Azure Data Explorer “land” a combined Kusto Sink Connector and Kafka Connect image targeting multi-platform. Databricks has released MLflow 2.3 with new interesting features.

Speaking of Databricks; new functions in Databricks SQL allowing integration of large language models (LLMs) into SQL queries. Oh, and let’s not forget using Generative AI to create Synthetic Data.

Big Data

  • Azure Databricks - Delta Engine and it’s Optimizations. In this post, the author discusses the Delta Engine, a feature in Azure Databricks that enables faster queries and data transformations on Delta Lake tables. The author explains how the Delta Engine works and its benefits, such as dramatically improving query performance compared to traditional Apache Spark queries. The article also discusses some best practices for using the Delta Engine, such as partitioning tables to improve query performance, caching and data skipping, and using the z-order clustering algorithm to further optimize data access. The article provides a helpful overview of the Delta Engine and its optimizations. It highlights how it can help data engineers and data scientists achieve faster query performance and more efficient data transformations on Delta Lake tables.

Azure Data Explorer

  • Cross-Platform Compatibility Made Easy with Multi-Platform Docker Images: Azure Data Explorer Sink Connector & Kafka Connect. Wow, that title is a “mouthful”. The post is by “yours truly” and discusses how to create Docker multi-platform images. You may say: “Just build them with the correct --platform tag and push them to your Docker Hub image repo” - yup, that’s what I also thought. I ran into that I wanted the images to have the same name and tag(s) regardless of platform. Something like, nielsb/myimage:latest built both for AMD64 and ARM64. I could easily build those two images using the --platform tag, but when I pushed them to my Docker Hub, they overwrote each other. Anyway, the post talks about how to handle that. I put the link in the Azure Data Explorer section because my test case was an image containing the Azure Data Explorer Kust Sink connector and Kafka Connect. That image is, by the way, publicly available here. I will maintain it and upload new versions whenever new Kafka Connect and/or Kusto Sink images are released.

AI/ML

  • Introducing MLflow 2.3: Enhanced with Native LLM Support and New Features. MLflow is an open-source platform that enables organizations to manage their end-to-end machine learning workflows, from data preparation to deployment and monitoring. Databricks recently announced the release of version 2.3 of MLflow. The new version offers several enhancements that improve the user experience and make working with low-level machine-learning frameworks easier. The new features include native low-level machine learning (LLM) support for TensorFlow, PyTorch, and other popular frameworks, allowing better integration with those tools. Overall, MLflow 2.3 is designed to help data scientists and machine learning engineers streamline their workflows and improve their productivity when working with complex machine learning projects.
  • Unlocking the Potential of Generative AI for Synthetic Data Generation. In last week’s roundup, I had a link to a post about Synthetic Data; the same goes for this week. This post examines how you can use Generative AI to create realistic synthetic data for software development, analytics, and machine learning. The post looks at using tools like ChatGPT and GitHub CoPilot to create the data. The article provides an interesting perspective on the potential of Generative AI for synthetic data generation. It highlights some key considerations and challenges that data scientists should consider when using these algorithms.
  • Introducing AI Functions: Integrating Large Language Models with Databricks SQL. This article discusses the release of AI Functions, a new feature in Databricks SQL that enables users to integrate large language models (LLMs) into their SQL queries. With AI Functions, data analysts and data scientists can use LLMs such as GPT-3 to generate text, perform sentiment analysis, and other natural language processing tasks directly within their SQL queries. The article highlights the potential of AI Functions for accelerating the adoption of machine learning and natural language processing in the enterprise. AI Functions can help democratize access to these powerful tools and enable more data-driven decision-making across organizations by making using LLMs within SQL queries easier.

Streaming

  • Real-time Messaging. All of us have probably heard of Slack, the cloud-based messaging and collaboration platform. The company has over 12 million daily active users sending and receiving millions of real-time messages 24/7. This blog post describes Slack’s architecture to send real-time messages at scale. The post looks at the services that send chat messages and various events to these online users in real-time. Very interesting!

~ Finally

That’s all for this week. I hope you enjoy what I did put together. Please comment on this post or ping me if you have ideas for what to cover.


comments powered by Disqus