Avoiding the Big Data Madness

As is often the case with any technology trend or meme, the term “big data” has been misapplied and misunderstood by many, particularly those without any practical knowledge of  information technology.

For all its current coolness, big data is far from new. Manufacturing plants and other industrial processes have been collecting massive amounts of data for decades.

To provide some perspective, a manufacturing operation with 100 machines, each machine having 60 interesting pieces of data to collect, and collecting that data at about once per second, will generate a staggering 180 billion data values per year, consuming probably three terabytes of permanent storage. And that’s a relatively tiny manufacturing plant.

There are well over 50,000 manufacturing and industrial sites in the world, so the numbers are significant. Yet Twitter, as much as it is held up as a poster child for “cloud scale data”, processes less than 180 billion tweets per year. Now imagine 50 billion connected devices streaming out data and events.

Welcome to *real* big data.

What is interesting, however, is that vendors each put their unique spin on big data to maximize their “sell what ya got” business proposition.

If you’re a sensor/hardware provider, it’s all about putting sensors on everything (which wastes money and energy—sometimes just adding comprehensive sensing to a subset of the device population is more than adequate.

If you’re a storage vendor, it’s all about collecting and saving data forever in case you might need it someday (unlikely). If you’re a cloud IaaS (or networking) vendor, it’s all about sending those gazillions of values over the network to the cloud to do things “at scale” (which is absurdly expensive). If you’re an analytics vendor, it’s all about gleaning insights from those gazillion readings (only a tiny fraction of that raw data actually has any useful value).

The reality of the situation is that there is no “one size fits all” answer.

What is certain, however, is that the “sensor to cloud” solution espoused by many IoT and general technology platform providers is bogus. The Internet of Things requires a distributed data solution—distributed collection, local aggregation, rules, and filtering, often even local analytics. Let’s not forget that “things” aren’t always connected to the Internet, either. And why on earth would you want to beam 8TB of a single engine’s performance data over a long range wireless link?

Data generally has a “half life” that can vary from seconds to years, with the vast majority of raw data having a half life less than 24 hours. Trends, aggregates, and outliers may produce far more insights, and it is optimal to do those detections as close to the device as possible in many cases.

Also, powerful technologies such as machine learning, which do benefit from large, very granular data sets, don’t need to be applied to the entire population of machines. Applying sound sampling and grouping approaches can reduce the cost and effort by many orders of magnitude.

We’ll be talking more about this in the weeks and months ahead, but suffice to say that “big data” in the IoT is *really* big, while we actually only need to transmit, process, store and analyze a relatively small percentage of it to see incredible ROI.

Rick Bullotta will join a panel discussion on “The Value of Big Data and Analytics from the Internet of Things” at LiveWorx 2015. Don’t miss the opportunity to gain more insight on this topic from industry thought leaders.