The light at the end of the tunnel is Apache Spark

by Otto Saayman
in Blog
on July 2, 2016

If all your data is in one database, your systems are probably slowing down at an exponential rate. If your data is distributed over different database platforms, servers and locations you are probably struggling to get a clear picture of what is going on in your organisation.

You are not alone!

These two scenarios, or variations of them, are becoming more common. More business activities are digitised. This means that data accumulates faster than ever before, pushing operational costs up. Apart from more processing power and more space your data also requires more maintenance. Extracting data becomes cumbersome. Backups become costly and time consuming.

In this post I want to talk about Apache Spark and how it can be used to solve the problem of a single database growing beyond its expected size – The first scenario of the two described above.

The problem

Our web site has been running for years. It uses a MySQL database and a simple, but effective, front end built on the Symfony PHP framework. The system was designed to be horizontally scalable. It uses GUIDs so that database servers can be scaled. It also logs a lot of activity into the database to gather statistics on user activity.

The database has become a behemoth and most of the data are now logs generated by people and bots browsing the site. Moving the site is becoming a more time consuming exercise. Large tables in MySQL are slowing the site down. Backups take hours and restoring the database will take long as well, so disaster recovery will be… a disaster.

Identify possible solutions

We can split the database. We can create a separate database just for logging data. This will require some re-work of the logging activities on the site.

We can remove some of the log data. Simple and effective, but we lose information. We still want to be able to run new reports on old data.

The solution – Apache Spark can help!

We can port the data over to the DataBricks.com platform making it available for all kinds of reporting and analysis. They even have an API which will allow us to pull data out of their platform back to ours.

With a few lines of code and a free account on the community edition we have done exactly that:

I started by setting up a read-only user on the client’s system and opening a port on their fire wall to allow this code to access the database via a normal JDBC connection URL on the standard MySQL port.

Due to the limitations of the free DataBricks community edition and to stop the code from timing out for no good reason I read the data 1 000 000 records at a time until the end of the table is reached.

Now we can truncate the data on the live system. If we need to do some analytics, we can use the DataBricks platform to generate some charts, for example:

Why Apache Spark?

This amazing platform makes it possible to combine data from any number of different sources using simple code hosted in the cloud.

All we want to do with the data is analyse it and generate some reports, or roll it up into smaller data sets for display purposes. Apache Spark gives us table views, bar charts, line graphs and all kinds of ther magnificent tools to do this.

It is robust and thanks to DataBricks it is possible to scale up our processing requirements using Amazon Web Services, so the possibilities are, literally endless.

Here is a link to my notebook on the DataBricks platform. You can view it, clone it and see how I have taken a very common headache out of a client’s life.

Reducing log file data on DataBricks.com

Tags: Apache Spark, Data Science, Machine Learning

The light at the end of the tunnel is Apache Spark

The light at the end of the tunnel is Apache Spark

Leave a Reply Cancel reply