The Big Data Paradox – a thought experiment for the data-mindful

Here is an interesting thought experiment for you to try.

Let’s make a few assumptions first:

  • All knowledge/facts in this world can be digitized and stored over connected computer hardware across datacenters worldwide.
  • Understanding a single data point is usually less valuable than a combination of related data sets, and their derived insights.
  • Generating insights from related data takes time and computing power which are proportional to the size and amount of time to physically move the data.
  • If we can draw a two-dimensional graph that contains all the datasets as vertices and their relationship as edges. We’ll find most if not all are connected in some way.
  • Data/knowledge changes over time.

With above assumptions, let’s say there is an all-powerful being, by leveraging the digital data, wants to be able to see through all details of the world, to understand all past facts and to predict the future. Is this realistic? Is this where we are going with Big Data engineering? Is this achievable using today’s technology?

data-graph-wide

If you are familiar with Laplace’s demon, it was a similar thought experiment brought up over two centuries ago.

We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect were also vast enough to submit these data to analysis, it would embrace in a single formula the movements of the greatest bodies of the universe and those of the tiniest atom; for such an intellect nothing would be uncertain and the future just like the past would be present before its eyes.

 

In Laplace’s version, the thought experiment relies heavily on assumptions of physics laws, both that were well understood then (i.e. Newton’s law) and not as much (i.e. quantum physics). Without a clear understanding of all the physics fundamentals, it’s going to be very hard to prove the determinism.

Similar in our thought experiment, within the data realm, we might not have all the necessary fundamental knowledge just yet. However, at least for now, we can try to tackle it from just one perspective that we are mostly familiar with.

In today’s world, most technology companies rely heavily on Hadoop clusters for batch-processing very large amount of data. The way typically it works is to bring all relevant data into a single processing cluster/datacenter to optimize for data locality, this is also done to save consistent roundtrip time for retrieving remote data. After this step, a process called MapReduce is done to extract useful information in a parallel, distributed fashion. MapReduce relies on data being locally accessible. On a side note, MapReduce has its own limitations, but I’ll leave that to a future blog post.

With above fact, let’s make a few observations:

  1. With today’s technology, data are moved and processed within a centralized physical location.
  2. Processing volume from today’s biggest batch processing cluster, compared to “all data in the world”, is just a very very small fraction.
  3. “Data processing” takes the input in data form and generates output containing insights also in data form. Successful Hadoop MapReduce processing increases the total amount of data.
  4. Generated insight data is related to other existing data.

With these observations, let’s expand it to the extreme. Let’s see if we can build a big enough datacenter to centrally process all the data in the world. After a bit thought, it’s easy to see a paradox indicating the infeasibility of our thought experiment with this approach.

If there is a big enough datacenter to centrally replicate all data from other datacenters in the world and has enough processing power to generate knowledge data. The generated knowledge data will inevitably have relations with existing data. Given this recursive pattern, and assuming some data will change over time, this datacenter cannot stably exist as it eventually will require infinite storage and compute power.

So the big question now, is all of this really unrealistic even in the near future?  I say not necessarily, we may be able to try to tackle it from an entirely different approach, let’s continue the discussion next time!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

Up ↑

%d bloggers like this: