In the technology world, things move so fast and problems are sometimes solved faster than they appear, so there are very few buzzwords that can last more than a decade. Big data is one of them. Recently here in the Silicon Valley, I often hear investors and founders talking about how not bringing up this type of buzzwords during pitching. I was curious, but I can sense there are a certain amount of fear in the air towards this untamed beast.
Modern big data concept has been evolving for more than two decades. What people might consider as Big Data just a few years back may be something completely different from the current state of the world. To be able to see through the fog into the future, it’s best to first understand the history.
Above is a very brief timeline of key technology concepts and milestone events to my knowledge. I think it’s fair to divide the timeline into two major eras:
In 1998 when Google first founded, there was roughly only 26 million web content on the internet, by 2000 this number grew to 1 billion. In 2008, it is estimated there is 1 trillion unique content on the web. The ability to index all the internet content, to be able to store and query efficiently were the major technology challenges.
Social & Mobile Era
The amount of data collected exploded. There is no official statistics but simply take a look around all the new super scale data center constructed worldwide. The ability to store, query, analyze, derive insights from the data very efficiently becomes the major technology challenges in this era. This battle is still ongoing.
Let’s also dive a bit into some technology concepts that played key roles.
- GFS – Google File System paper was published in 2003. This technology allowed extremely large amount of data to be stored reliably using clusters of commodity hardware.
- MapReduce – Google published MapReduce paper in 2004. This technology allowed extremely large amount of data to be processed in a parallel distributed environment. This technology is the keystone for batch processing systems.
- NoSQL movement – NoSQL refers to a set of non-relational data storage technologies. Most of the NoSQL data stores are distributed system that follows CAP theorem. Large scale batch processing results are typically stored in this type of data stores.
- Stream processing – stream processing is a new technology that currently faces lots of technical challenges. It plays a very different role than batch-processing and may become the foundation of future hyper-scale data processing model.