Throughout the week, I read a lot of blog-posts, articles, etc., that has to do with things that interest me:

  • data science
  • data in general
  • distributed computing
  • SQL Server
  • transactions (both db as well as non db)
  • and other “stuff”

This is the “roundup” of the posts that has been most interesting to me, this week.

This week there will be quite a few links to white-papers from this years The Conference on Innovative Data Systems Research (CIDR). It was started in 2002 by very illustrious people from the database industry: Michael Stonebraker, Jim Gray, and David DeWitt! The conference gives the database community a venue to present groundbreaking and innovative data systems architectures. This year it was held January 8 - 11, and you can find all the presentations here.

I have had a quick glance through the white-papers and following are the ones that I am interested in and have had a chance to look at in some details:

  • Data Ingestion for the Connected World. Discussion around new architecture for doing ETL in a world where real-time data is of out-most importance. The solution, which I am really, really interested in getting to know more about, centers around:
    • Apache Kafka, a message broker type system
    • S-Store which is a streaming OLTP engine, which seeks to seemlessly combine online transactional processing with push-based stream processing for real-time applications.
    • Intel’s BigDAWG, a distributed polystore engine
  • Evolving Databases for New-Gen Big Data Applications. Presenting a system for handling high-volume transactions while executing complex analytics queries concurrently in a large-scale distributed big data platform.
  • SnappyData: A Unified Cluster for Streaming, Transactions, and Interactive Analytics. Yet another system for OLTP workloads and analysis in real-time.
  • The Data Civilizer System. As a data scientist you probably spend most of your time finding, preparing and cleaning data, instead of doing “real” work! This paper presents Data Civilizer, a system to help data scientists to:
    • discover data sets from large number of tables
    • link relevant data sets
    • compute answers from the data stores that holds the discovered data
    • clean the desired data
    • iterate through the tasks using a workflow system

As mentioned before, the above papers were the ones of interest that I had a chance to at least skim through. There are a wealth more of papers at the site, so go an have a look. I also want do a shout-out to the morning paper, which - the last week - has started dissecting these papers. So if you don’t have time to go through all the papers yourself, browse to the morning paper, and get the papers served to you!

So what else have I found interesting this week:

Data Science

Distributed Computing

  • Apache Kafka: Getting Started. Apache Kafka is one of the more popular message brokers out there (it is much more than a message broker), and Kafka appears in most solutions for distributed applications. Just see above in Data Ingestion for the Connected World! This post is a very good introduction how to get started with Kafka.
  • Reactive Kafka. Kafka again. This time from InfoQ, and a presentation about how the new reactive streams interface for Kafka can be used to build robust applications in the microservices world.

SQL Server

That’s all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.


comments powered by Disqus