Friday, December 2

Big Data


Several weeks ago, JD phoned to ask how my research is coming along and if I had any large amounts of data arriving soon.  His timing was impeccable; genetic work generates exponentially more data than more "traditional" physiological experimentation.  JD, sounding excited, asked how many data points that would involve.  So I explained that each array plate houses information on 84 genes for 4 treatment groups, I have 4 biological replicates, and 3 different arrays.  That is 4032 data points and, in my opinion, quite a bit of it.  JD was disappointed and explained that this could be handled by a single computer in a series of milliseconds.  He was after data which would require a hundred computers hours to handle (or some outrageous amount of computing power like that) for a school project.  All of my school projects, perhaps with the exception of my Lyman Briggs senior thesis, pale by comparison. 

I was reminded how minuscule my data-crunching problems may be by a story on Morning Edition earlier this week: "The Search for Analysts to Make Sense of 'Big Data'."


The story focuses on the venture capital/business side of the problem, but it's also an increasing problem in the world of biomedical research, as investigations shift towards high-throughput, 'omic' scale studies.  Issues with data have been well-documented in the realm of genomics: earlier this week this was discussed by the New York Times, and by the computing news site HPCWire.  But there are also developing fields of proteomics, metabolomics, and interactomics.  Essentially, these fields seek to study their particular focus in a more comprehensive way than ever before.  Proteomics seeks to take a sample and analyze all the protein parts of that sample simultaneously, more or less.  Metabolomics seeks to analyze all changes across space and time in small molecule metabolites (although a strict definition of what small molecules qualify is debated) -- which some argue is the truest method of assessing functional biological changes, since changes to DNA, proteins, and protein processing ultimately result in changes in cellular function which can be observed.  Assembling an interactome seeks to address how everything within a cell, which scientists have gotten quite skilled at deconstructing, can be brought back together into spatially and temporally relevant understandings of organismal function. 

Phew...it's intense!  And as one might imagine, such quests for knowledge generate massive amounts of data, and these require storage, analysis and interpretation.  These are the kinds of things, generating megabytes and terabytes of information, which can be handled by the "supercomputers" JD is working with.  And the result is as many publications on the computational/bioinformatic side of these fields as there are publications on the biology itself.  

We are at the point where how we handle data is as important as the data itself.  How should these data be stored?  How can they be stored securely if and when they include privileged patient information?  How should data be analyzed?  How can such data sets be stored/distributed for future or further investigation?

For me, that's one really spectacular thing about science: the more we learn, the more ways we find to learn more, and the more we adapt and develop technology to permit this learning in the first place.  And although I think my dissertation will take long enough, sometimes it is incredible to think that just a few years ago my project would have been impossible to complete within the 7-year graduation limit.  Five-to-ten years from now, what I will complete in three and a half years may only require a few months. 

0 Thoughts: