It used to be if you wanted to run a big database, you got yourself a big computer. For a really big database, that meant a big mainframe or a cluster of hefty servers. But today, businesses and researchers alike are interested in vast collections of data that would swamp even a supercomputer and overwhelm any standard database management software.
Welcome to the world of big data. The exact definition of big data is a bit slippery, but Wikipedia does quite well: "Data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set." Examples of such data sets range from billions of Google searches conducted by millions of users to the data collected by millions of weather sensors around the globe to all the purchases of British supermarket shoppers.
The amounts of data collected can be staggering. According to the report "Big Data: The Next Frontier for Innovation, Competition, and Productivity" by the McKinsey Global Institute, UK-based retailer Tesco collects 1.5 billion items of data on customer behavior every month while Facebook shares 30 billion user-produced items of content each month.
Big data enables analysis that is different in kind, not just in scale, from conventional database analysis by allowing analysts to discover information they did not know was in the data. For example, the tracking of disease outbreaks has long depended on the slow filing and compilation of reports by doctors and hospitals. But for the past couple of years, Google has been ahead of public health authorities in monitoring flu outbreaks by compiling public searches for flu-related information by geography. (View diagram of Flu search avtivity above. Click here for the original.)
Another area of health use is gleaning useful treatment information from millions of patient medical records as these go electronic. Predictive Medical Technologies analyzes records of intensive care patients to detect events that might be signals of adverse events, such as cardiac arrest or arrhythmia. Once trends are identified, real-time monitoring of patients can spot similar patterns and give doctors critical early warning.
Big data raises new technical and privacy issues that must be dealt with for the technology to reach its potential. Traditional databases typically run on a single computer or a tightly integrated cluster of servers. Queries are run against the entire database in fairly straightforward fashion, though tremendous effort can go into tweaking performance to the maximum.
Big data, by contrast, is often found on distributed systems that involve hundreds, thousands, or in extreme cases such as Google, millions of servers that are often dispersed all over the globe, linked either by private networks or, more often, the public internet. Efficient processing requires high bandwidth, low latency network links, particularly if data are being used in anything approaching real time.
Analyzing the data also requires different software techniques. Probably the most important is MapReduce, a procedure developed (and patented) by Google for running queries across its vast network of servers. It provides tools that, in rough terms, map just where the data are located in the maze of servers and then collect the desired records into a manageable dataset for analysis. Apache Hadoop, originally developed by Yahoo!, is a widely used open-source version of MapReduce.
Managing the privacy implications of big data may be more difficult. When enough data about an individual is collected, it may become possible to identify a person uniquely even though none of the information is classed as "personally identifiable," a process known as "de-anonymization." So far, this threat is largely theoretical but fears about the privacy implications of big data will have to be addressed if the technology is to reach its full potential.
The contents or opinions in this feature are independent and do not necessarily represent the views of Cisco. They are offered in an effort to encourage continuing conversations on a broad range of innovative, technology subjects. We welcome your comments and engagement.