Interesting Stuff - Week 19, 2023

Posted by nielsb on Sunday, May 14, 2023

Some very cool stuff this week. Polars for Rust! Azure Data Explorer as a Vector DB. ChatGPT and Pandas = PandasAI.; and real-time streaming ecosystem.

LangChain, Streamlit, and Pinecone for LLM. Real-time sentiment analysis with Docker, Kafka, and Spark Streaming. And more!

Misc.

  • Rust Polars: Unlocking High-Performance Data Analysis — Part 1. I have in previous roundups mentioned Polars, but in those cases, it has been Polars for Python. The article here is about Polars for Rust. The article compares Polars with Pandas and shows some of the benefits of Polars, such as no index, Apache Arrow arrays, parallel operations, and lazy evaluation. It also demonstrates how to install and use Polars for basic data manipulation. The article concludes by saying that Polars is a promising alternative to Pandas for data analysis.

Azure Data Explorer

  • Azure Data Explorer for Vector Similarity Search. The article shows how to use Azure Data Explorer (ADX) for vector similarity search, a technique for finding similar vectors in a dataset. It explains the concepts of vector database, vector similarity, vector embeddings, and vector similarity function (VSF). It also demonstrates how to use ADX to search various Wikipedia pages. I found this very interesting, and I will do some tests with this myself.

AI/ML

  • Deploying a Langchain Large Language Model (LLM) with Streamlit & Pinecone. As the title implies, this post demonstrates how to deploy a large language model (LLM) application using Streamlit, Pinecone, and Langchain. It explains how to use Langchain to create a chain that connects an LLM model, a vector database, and a vector similarity function (VSF). It also shows how to use Streamlit to create a web interface for the application and how to use Pinecone as a vector database. Very cool!
  • PandasAI — Pandas Newborn child from ChatGPT. The article shows how to use PandasAI, a new library that enables data manipulation by the use of conversational AI. It uses ChatGPT to understand natural language queries and execute them using Pandas. It also demonstrates how to use PandasAI for data exploration and machine learning on a sample dataset.

Streaming

  • Real-Time Streaming Ecosystem - Part 2. This post is the second in a multi-series exploring the real-time streaming ecosystem and its various components. The post covers connectors, change data capture (CDC), ELT, and rETL solutions and providers. Excellent article. I am looking forward to the next in the series!
  • Real-Time Sentiment Analysis with Docker, Kafka, and Spark Streaming. This is a continuation of a previous article that compared different classification algorithms and feature extraction functions for tweet sentiment analysis with PySpark. This post is about performing real-time sentiment analysis on tweets using Docker, Kafka, and Spark Streaming. It explains how to set up a data pipeline that extracts tweets from the Twitter API, sends them to a Kafka producer, consumes them with a Spark Streaming application, and loads the results to MongoDB or Delta Lake. The article also provides a pre-trained PySpark model for tweet classification and shows how to use it in the Spark Streaming application. Very interesting!

~ Finally

That’s all for this week. I hope you enjoy what I did put together. Please comment on this post or ping me if you have ideas for what to cover.


comments powered by Disqus