Interesting Stuff - Week 3

Throughout the week, I read a lot of blog-posts, articles, etc., that has to do with things that interest me:

data science
data in general
distributed computing
SQL Server
transactions (both db as well as non db)
and other “stuff”

This is the “roundup” of the posts that has been most interesting to me, this week.

This week there will be quite a few links to white-papers from this years The Conference on Innovative Data Systems Research (CIDR). It was started in 2002 by very illustrious people from the database industry: Michael Stonebraker, Jim Gray, and David DeWitt! The conference gives the database community a venue to present groundbreaking and innovative data systems architectures. This year it was held January 8 - 11, and you can find all the presentations here.

I have had a quick glance through the white-papers and following are the ones that I am interested in and have had a chance to look at in some details:

Data Ingestion for the Connected World. Discussion around new architecture for doing ETL in a world where real-time data is of out-most importance. The solution, which I am really, really interested in getting to know more about, centers around:
- Apache Kafka, a message broker type system
- S-Store which is a streaming OLTP engine, which seeks to seemlessly combine online transactional processing with push-based stream processing for real-time applications.
- Intel’s BigDAWG, a distributed polystore engine
Evolving Databases for New-Gen Big Data Applications. Presenting a system for handling high-volume transactions while executing complex analytics queries concurrently in a large-scale distributed big data platform.
SnappyData: A Unified Cluster for Streaming, Transactions, and Interactive Analytics. Yet another system for OLTP workloads and analysis in real-time.
The Data Civilizer System. As a data scientist you probably spend most of your time finding, preparing and cleaning data, instead of doing “real” work! This paper presents Data Civilizer, a system to help data scientists to:
- discover data sets from large number of tables
- link relevant data sets
- compute answers from the data stores that holds the discovered data
- clean the desired data
- iterate through the tasks using a workflow system

As mentioned before, the above papers were the ones of interest that I had a chance to at least skim through. There are a wealth more of papers at the site, so go an have a look. I also want do a shout-out to the morning paper, which - the last week - has started dissecting these papers. So if you don’t have time to go through all the papers yourself, browse to the morning paper, and get the papers served to you!

So what else have I found interesting this week:

Data Science

Microsoft R Server tips from the Tiger Team. Blogpost from Revolution Analytics with quite a few links with tips about Microsoft R Server. Very useful “stuff”!!
Announcing Data Science Utilities Version 0.11, for the Team Data Science Process. Microsoft has released a new version of tools and utilities for its Team Data Science Process. This something I will take a very close look at!
Microsoft R Server in the News. Another blogpost from Revolution Analytics, this time with links to articles in the tech press about the capabilities of Microsoft R Server in production environments. Some cool stuff in there!

Distributed Computing

Apache Kafka: Getting Started. Apache Kafka is one of the more popular message brokers out there (it is much more than a message broker), and Kafka appears in most solutions for distributed applications. Just see above in Data Ingestion for the Connected World! This post is a very good introduction how to get started with Kafka.
Reactive Kafka. Kafka again. This time from InfoQ, and a presentation about how the new reactive streams interface for Kafka can be used to build robust applications in the microservices world.

SQL Server

Automate Delivery of SQL Server Production Data Environments Using Containers. Exactly as what the title says; how containers can be used in the SQL Server World. This is something that is of very much interest to us here at Derivco, seeing how many SQL Server instances we have out in the world (we have one of the biggest SQL Server installations world-wide).

That’s all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.

Data Science

Distributed Computing

SQL Server

CATALOG