Internet Age Computing, Data and Spark
Google is the trailblazer in terms of internet age style computing. If your mission is: “Organize the world’s information and make it universally accessible and useful.” – one can expect to run into a few challenges. Google was prepared to commit serious money to solving these challenges, and to to do so they recruited the smartest people they could find. Some of what has occurred over the past 15 years:
Storage and computation
Storage
Instead of achieving reliable scale with ever larger proprietary Storage Area Networks (SANs) from expensive vendors such as Hewlett Packard, use commodity machines and disks with clever software. In 2003 Google published a paper on how they did this, calling it the Google File System.
The open source world created the Hadoop File System (HDFS) based on Google’s work.
Since then the world can store Petabytes of data cost effectively.
Distributed computation
Want to now programmatically access and process your Petabytes of data? Instead of trying to use one big machine from vendors such as Sun or Hewlett Packard to do it, use 1000’s of commodity machines. In 2004 Google again published a paper: “MapReduce: Simplified Data Processing on Large Clusters”.
The open source world created the Hadoop version of MapReduce based on Google’s work. Ask big questions of big data.
Since then the world can cost effectively run distributed programs on Petabytes of distributed data.
Databases – from one species to a Cambrian explosion
Distributed structured data processing
Want to add some more structure to your data and ask specific, small questions from your petabytes of data, with less programming effort, get a fast response and not pay relational database vendors such as Oracle a fortune? Google did, and in 2006 Google again published a paper: “Bigtable: A Distributed Storage System for Structured Data”.
The open source world created HBase, the Hadoop stack version of BigTable based on Google’s work.
Since then the world can create, query and update structured data at petabyte scale with distributed, commodity computing and storage infrastructure.
In order to achieve this scale, Google had to revisit the fundamentals of data, computing and persistent state. It had to revisit the concept of databases. The BigTable paper cites papers going back to 1960, which is a long time ago in the world of computing. BigTable is a kind of database, just not a relational database.
The way to think about the broad options we have when it comes to database design is clarified by the CAP theorem – summarised as:
“You can pick any two of
Consistency
Availability
Partition tolerance
but not all three.
Traditional relational databases such as Oracle, MySQL etc. optimise for the first two – Consistency and Availability. The price you pay for that is that you cannot scale horizontally across 1000’s of commodity machines.
BigTable optimises for Consistency and Partition tolerance, with a slightly higher risk of the BigTable cluster not being Available. In addition Google relaxed schema compliance, enabling one to easily extend the type of data that a table stores, with rows in a table not all having to comply to the same strict structure, or “schema”.
By this time, other new internet giants were applying their minds to these computing challenges. In 2007, Amazon published the paper “Dynamo: amazon’s highly available key-value store”.
DynamoDB is a key-store, with each row being a key followed by one unstructured data field. DynamoDB optimises for Availability and Partition tolerance, with a slightly higher risk of data not being consistent across nodes in the cluster (choosing eventual consistency).
Facebook got in on the act and built Cassandra, which it open sourced in 2008. Cassandra is essentially a hybrid between BigTable and Dynamo: the structured, flexible columns per row approach (not only key-value) from BigTable and the high Availability approach of Dynamo: a slight compromise to data Consistency across the cluster.
The database genie was out of the box.
Google and other’s success in being able to update and query petabytes of data led to broad innovation in the field of databases, a process that is still ongoing. Most of these innovations are in the open source space. The catch-all term for these non-relational databases is NoSQL databases. They all provide varying optimisations across Consistency, Availability and Partition tolerance.
Previously, the database world was dominated by vendors such as Oracle, IBM, Microsoft etc. These vendors controlled the proprietary software related to how data was stored, accessed and updated. This space has been “disintermediated” by open source software, enabling choice at each level of a database system:
- storage options: HDFS, S3, Google Cloud storage etc.
- file formats for storing data – Parquet, ORC, Kudu etc.
- “SQL” engines to access and update data – HBase, Hive, Impala, Spark, Cassandra etc.
Choosing a database is no longer just about choosing the relational database vendor. It is about choosing the correct optimisation across aspects such as consistency, availability, partition tolerance, read speed, write speed, schema flexibility etc. It also means that migrating from one database technology to another may be more complex, since not all database adhere to the ANSI SQL standard.
Spark – your Swiss Army Knife for Data
A relatively recent entrant in the world of data and data processing is Apache Spark. Originally developed at the University of California, Berkeley’s AMPLab in 2009, the Spark codebase was donated to the Apache Software Foundation in 2013.
The amazing thing about Spark is how effortlessly it combines a multitude of programming paradigms and data processing tools.
Want to load data?
Fine. Spark can read a multitude of file formats: CSV, Parquet, ORC, Avro or any database source.
Want to process your data?
Great. What language do you prefer? Java, Scala, Python, R? All are welcome. SQL? Of course, if it involves data then feel free to use your SQL skills. You can intertwine your preferred language with SQL as you wish. Awesome.
Batch mode or streaming? No problem, Spark does both.
Want to process A LOT of data?
Sure thing. Spark has a distributed architecture. You can use 1000’s of cost effective nodes in unison to crunch through your data.
Want to analyse your data?
Naturally. Descriptive statistical tools are part and parcel of Spark. Standard deviations, histograms, correlations, hypotheses testing, principal component analysis, clustering etc. etc.
Want to use your data?
Now we’re talking. Recommendation systems, image recognition, natural language processing, linear regression, classification, fraud detection, anomaly detection etc. If you can think it, you can do it in Spark.
Pretty cool. Get your data, process your data, make your data work for you – all on one platform.
To get started with Spark, check out our 2 day training offering: Data Engineering and Data Science with Apache Spark.
We look forward to the future of information processing platforms. The faster we can automate the world the faster we can all enjoy margaritas at the beach…