Inside a conference room off the lobby of one of the many buildings on the Microsoft campus in suburban Seattle, Dave Brown holds court on (relatively) big data. Brown is a senior research software development engineer for Microsoft Research, which is a long and official way to say he has been working with data and systems for years, and is here to show off some of that work.
Standing next to a screen large enough to mount on most living room walls, he sifts through what he describes as 40,000 lines of code about American tornadoes. He has already plugged them in, searched for patterns and, more important, worked to figure out what those patterns represent. The result is a 3-D spatiotemporal interactive data visualization — in this instance, of the contiguous United States, littered with data points, more in the Midwest than anywhere else.
“This,” he says, pointing toward Kansas, “is a small set.”
Small because big data is a relative term. Some companies consider a couple hundred gigabytes in the system to be big data, while others hardly blink at thousands of terabytes. The collection and storage of data isn’t a problem — “The amount of computation and storage you can buy per dollar keeps getting greater,” says Doug Cutting, chief architect of Cloudera and the founder of numerous open source projects that have supported big data, including Lucene, Hadoop and Avro — nor is sifting through it. But actually analyzing it and finding meaning can provide a problem.
Start with the DIKW Pyramid, a business and data staple for decades. DIKW is an acronym, short for data, information, knowledge and wisdom. Data is the base of the pyramid, and given context, it becomes information. In turn, information given context becomes knowledge, and knowledge given context becomes wisdom, which is the top of the pyramid. Most of us are far better at collecting data, the base of the pyramid, than we are at turning it into knowledge. That, in brief, is the biggest problem with big data.
“We can leverage more data than ever before,” Cutting says, “and we have to, because we can use it to see at a very high resolution what we’re doing — and we can improve it.”
For instance, that tornado data, without context, is just 40,000 lines of dates, numbers, locations, stories and lives captured in a digital nugget. But when properly analyzed, they show trends, patterns even. Where will tornadoes land? When will they land? Is there any indication about how strong the next one might be? Brown’s set indicates there is an indication, along with a forecast for more twisters in the late afternoon in the central Great Plains states.
Every business can do the same with its own data if analyzed properly. You make your money work for you in stocks, bonds, equipment, property and capital. Why not make your data work for you, too? Turn your big data into big analytics.