Big Data: Bigger is Not Necessarily Better

Big data is the current trendy phrase that covers many different areas.  Big data describes equally well having a huge volume of data generated in a short period of time (like molecular simulations of DNA), having a huge volume of data that needs to be indexed and archived (like PubMed or Web of Science), or wanting to analyze different types of data that wasn’t collected for a given purpose (the CI-BER project uses a variety of data types collected over the years to study a neighborhood in Asheville, NC).

Big data is the current trendy phrase that covers many
different areas.  Big data
describes equally well having a huge volume of data generated in a short period
of time (like molecular simulations of DNA), having a huge volume of data that
needs to be indexed and archived (like PubMed or Web of Science), or wanting to
analyze different types of data that wasn’t collected for a given purpose (the
CI-BER project uses a variety of data types collected over the years to study a
neighborhood in Asheville, NC
).

The Virtual School of Computational Science and Engineering
(VSCSE)
 exists to fill the gap in
people’s training.  Many people
having big data situations were trained as physical scientists, life
scientists, or social scientists, not computer science folks.  Even those trained as computer science
folks, librarians, or archivists were not trained to deal with this volume of
data if that training was completed more than a couple years ago.  VSCSE sponsors workshops and short
courses in targeted areas for motivated people who know they need more
education, but are not going to go to a formal college or graduate degree
program. 

The virtual part refers to the fact that these folks are
extremely comfortable with technology. 
The Data Intensive summer school held 8 July-10 July 2013, for example,
was being teleconferenced to more than 400 participants at 15 listed
sites.  I sat in a room at the
National Center for Supercomputing Applications in Urbana, IL, but the speakers
have been broadcasting from their home sites (e.g., San Diego, Chicago, North
Carolina) and questions/comments come from all the sites.  As one speaker put it, it’s kind of
weird to speak to people you can’t always see, but I’m ok with that.

Three main problems exist when dealing with big data:
storage, interaction, and analysis. 
The summer school covered all three aspects.  I will report here on things that may help others who did
not sit through the summer school. 
I do not have answers to all the questions I encountered, but I am going
to share some interesting questions that have made me think recently. 


Storage

Several of the talks touched on just the scope of having too
much data.  Even with terabyte
drives being cheap and readily available, the storage problems are still the
same as when I was in school: data can be generated far faster than one can
analyze it and will exceed local storage if one is not careful.  Even a petabyte can fill quickly with
multiple users who are each generating terabytes. 

In fact, the volume problem can be worse than it used to be
because of the size of individual files, the number of files, or both.  Ending up with several thousand files
that are each tens, if not hundreds of gigabytes, of data is becoming a common
problem for people who do the kinds of simulations that I do.  Worse is that those files are probably
generated on one supercomputing cluster, but need to be analyzed or stored
somewhere else because that’s how things were done years ago.  The system admins become downright rude
about people filling up scratch space, but, in terms of researcher time
management, leaving generated files on the scratch drives for a few weeks is
often the best solution.  Dropbox,
cloud storage, and Google documents are great for people who only need to share
a couple of reports or presentations, but it’s not at all useful for the
researchers who are routinely dealing with thousands of files that are each
huge.

The good folks in the Chicagoland area (Argonne, University
of Chicago) have made Globus.  Globus is a way to efficiently transfer
files for supercomputer and similar cluster users that takes into account how
much babysitting the big files need to transfer.  I have not used it, but I will try it the next time I am in
the situation of having my data being generated offsite and need to transfer to
storage.

 

Interaction

Several of the talks at the summer school described the
difficulty of interacting with huge amounts of data.  Automated searches are nice, but how are things indexed in
the first place?  What about data
sets that consist of widely varying things that are hard to describe?  For example, many databases are great
with formatted records, but what about data that is generated and is
unformatted, free-form, or wildly different formats (not just Excel, plain
text, and Kaleidagraph, but movies, pictures, maps, pointers to books or
physical objects)?  Limiting
oneself to things that are easily described and indexed using traditional means
missing out on perhaps as much as 70-85% of the data currently in electronic
form. 

I was particularly taken by one speaker who mentioned that
many of the easily dealt with data are misleading.  For example, a numerical poll may tell you much less about
what people are thinking than the free-form responses.  However, tabulating a million numerical
poll results is computationally easy and dealing with a thousand free-form
written responses is not. 
Searching something like PubMed by keyword is easy, if you know the
right keywords, but is extremely difficult if the words you want were not what
someone else considered important enough to index, especially if the records
are not searchable text.  The novel Mr. Penumbra's 24-Hour Bookstore has a very nice example of
the protagonist searching a database for the record of a physical object that
must be in storage somewhere and coming up empty because he can't figure out
the proper keywords.  I will not
share the spoiler on how the protagonist solved the problem by hacking the
carbon interface instead of the silicon interface, but it is very clever and a
good example of how computers are not as smart as people are even now.

I still have more questions than answers, because that is
the state of the field, but I look forward to trying some things for myself and
reading/talking with people who are pushing hard to find solutions for their
mixed huge data set problems.

 

Analysis

People who have data usually want to perform analysis to
turn data into information.  This
is also far trickier than one might expect if one is doing anything other than
simply scaling up an analysis to a bigger data set of coherently formatted
records.  Even simply scaling up an
analysis can become tricky when machine memory will be exceeded or a program
has a hard limit on the number of entries.  For example, I have overwhelmed Excel by having too many
rows of data to read in and I am told limits exist for many other
programs.  Once people start
talking about databases with millions of entries and tens of thousands of
possible keys for every entry, the unwieldiness is evident.  That is indeed big data without good
solutions and we haven't even gotten to what do you do when a data set is not
in a formatted way that can be automated and you aren't quite sure how to use it
anyway.

 In the summer school, we were introduced to some functions
in R that might help people plot and learn something.  My favorite is get_map, which allows you to use the data
from Google maps (a coherent description with examples is here).  Another interesting tool was the Google
motion chart
,
which I suspect is how Michael Marder makes his interesting plots on what matters
for science education
.

 More flexible tools in R were in the package ggplot2
(a nice overview
of the functionality is here
).  If you want
slick looking, complex plots, then this is for you.  I am continuing to experiment with facet_grid because I often have
multiple small plots to make to look at slices through my data by extra
variables.  Having a program that
will do it for me saves me a lot of time.

 

Summary

Big data is the latest buzzword that affects some of us (you
do know about the requirement for a data management plan for your next NSF
proposal
, don't you?),
but it is real problem for people doing cutting-edge science.  If you find yourself overwhelmed with
your data, then I highly recommend that you attend a workshop or seminar series
to share ideas, even with those who are far outside your field.  It's time well spent.

Old NID
116292
Categories

Latest reads

Article teaser image
Donald Trump does not have the power to rescind either constitutional amendments or federal laws by mere executive order, no matter how strongly he might wish otherwise. No president of the United…
Article teaser image
The Biden administration recently issued a new report showing causal links between alcohol and cancer, and it's about time. The link has been long-known, but alcohol carcinogenic properties have been…
Article teaser image
In British Iron Age society, land was inherited through the female line and husbands moved to live with the wife’s community. Strong women like Margaret Thatcher resulted.That was inferred due to DNA…