Interesting Stuff - Week 33

Throughout the week, I read a lot of blog-posts, articles, and so forth, that has to do with things that interest me:

data science
data in general
distributed computing
SQL Server
transactions (both db as well as non db)
and other “stuff”

This blog-post is the “roundup” of the things that have been most interesting to me, for the week just ending.

Distributed Computing

Canopy: Scalable Distributed Tracing & Analysis @ Facebook. This post links to an InfoQ presentation about Canopy, which is Facebook’s performance and efficiency tracing infrastructure. The presentation covers lessons learned applying Canopy and present case studies of its use in solving various performance and efficiency challenges. Very interesting!
Designing Distributed Systems. A download link to an e-book: Designing Distributed Systems. The e-book provides repeatable, generic patterns, and reusable components to make developing reliable systems easier and more efficient. It is written by Brendan Burns who is a Distinguished Engineer at Microsoft and works on Azure.

Cloud

Schema-Agnostic Indexing with Azure Cosmos DB. In this blog post, Murat dissects a white paper about the schema-agnostic indexing subsystem of Cosmos DB. The post (and the paper) is very interesting, go ahead and read it, please!
Azure #HDInsight Interactive Query: simplifying big data analytics architecture. This post discusses a new feature of Hive 2, Low Latency Analytics Processing (LLAP). LLAP produces significantly faster queries on raw data stored in commodity storage systems such as Azure Blob store or Azure Data Lake Store. This is quite exciting, and I need to check it out!

Streaming

How to Build a UDF and/or UDAF in KSQL 5.0. Not one week without at least one Kafka related post - that is the “law”. This post discusses a new feature in KSQL 5, the ability for the users o write their own functions for KSQL to use. Think about the possibilities that open up!

Data Science / AI

Neural Networks from a Bayesian Perspective. This post covers different ways to obtain uncertainty in Deep Neural Networks from a Bayesian perspective. The post is quite theoretical but very interesting!
100x Faster Bridge between Apache Spark and R with User-Defined Functions on Databricks. Spark exposes an API, SparkR User Defined Function API, which acts as a bridge between Spark and R. Unfortunately the bridge is far from efficient. Databricks has made the bridge more efficient when you run Spark on Databricks, and this post talks about how it is done.
The most important part of a data science project is writing a blog post. A somewhat provocative title of this blog post, but it makes a good point. Always document your data science projects so other data scientists can see what you have achieved!

SQL Saturday

It is that time of the year again: SQL Saturday season! As usual I present in Johannesburg, Cape Town and Durban:

Johannesburg, September 1:
- Overview SQL Server Machine Learning Services.
Cape Town, September 8:
- Azure Machine Learning.
- The Ins and Outs of sp_execute_external_script.
Durban, September 15:
- The Ins and Outs of sp_execute_external_script.

Even if you are not interested in the topics I present, please register and come and listen to a lot of interesting talks by some of the industry’s brightest people.

PreCon

This year I also do precons in Cape Town and Durban on the Friday before the SQL Saturday event. My precons is a day where we talk about SQL Server Machine Learning Services, what it is and what we can do with it. It is in a format so if you want you can bring your laptop and code along as the day progresses.

The precon is not free, but hey …

Even though the titles of the precons are different, I cover the same material.

~ Finally

That’s all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.