The Real Limitations of Big Data

“Every revolution in science—from the Copernican heliocentric model to the rise of statistical and quantum mechanics, from Darwin’s theory of evolution and natural selection to the theory of the gene—has been driven by one and only one thing: access to data.”

That was the eye-opening opening of a keynote address given yesterday by the brilliant John Quackenbush, a professor of biostatistics and computational biology at Dana-Farber Cancer Institute who has a dual professorship at the Harvard T.H. Chan School of Public Health and ample other academic credits after his name.

There is also no question that this digital fuel is driving virtually every transformation in healthcare happening today. Speaking at the MedCity Converge conference in Philadelphia, Quackenbush noted that the average hospital is generating roughly 665 terabytes of data annually, with some four-fifths of it in the unstructured forms of images, video, and doctor’s notes.

But the great limiting factor in harnessing all of this information-feedstock is not a “big data problem,” but rather a “messy data problem.”

In sum, in places where there is tons of potentially useful data to examine, we don’t make it accessible in ways that people actually want to use it. Either the data isn’t easy or intuitive to access or it simply isn’t informative. Or it’s in the wrong format. Or it’s incomplete—or created with incompatible “standards” (of which we seem to have an unlimited, irreconcilable supply). Or it captures just one dimension of a multidimensional realm. (“Biological systems are really complex, adaptive systems with many moving parts, that we’ve only begun to scratch the surface of understanding,” he says.)

 

Or—and this one seems to be a surprisingly common misstep—the data doesn’t really address the question the end user wants to answer. It’s off-purpose, in other words.

Take the case of population-level data, which government and academic institutions routinely collect: “Statistics operate on population data and medical research is driven by population data,” says Quackenbush, “but medical care is driven by individual-level data. So when we’re driving [our data research] to the clinic, we have to think about how we’re going to make that individual-level available in a meaningful format.”

Ultimately, the goal, he says, should be to “create intuitive graphical representations of the underlying data” in ways that allow non-data scientists “to explore it without having to sit at a terminal and type in a bunch of obscure commands.”

“What you want to think about doing when you make data available to people is to create interfaces that allow them to dive in and make sense of that data, using their own intuition,” Quackenbush says.

Without doing that, all of our growing mounds of big data will simply be big blobs on ever-bigger data servers.

What’s to stop that from happening? The incentive for turning all this raw feedstock into a usable fuel “is not going to be enhancing healthcare or making people better,” Quackenbush says flatly. “The driver is really going to be the most important ‘–omics’ science of all: which is economics. We have to show that there’s an advantage to bringing this kind of data and information together if we’re really going to make advances.”

 

By Clifton Leaf, this essay appears on August 2nd edition of the Fortune Brainstorm Health Daily.

BLOG COMMENTS POWERED BY DISQUS