Feature Story

The Busy Executive’s Guide to New Data Tools

Enterprises need new ways to collect, store, and analyze increasingly rich data.

Today’s digital economy generates more data than ever, and tomorrow’s will generate even more. Indeed, experts generally agree that the size of the digital universe will continue to double every two years – in large part due to emerging technologies.

“Human- and machine-generated data is experiencing an overall 10x faster growth rate than traditional business data, and machine data is increasing even more rapidly at 50x the growth rate,” according to Inside Big Data.

This ongoing deluge of data provides tremendous opportunities to enterprises that can derive strategic and operational benefits from the data they collect from apps, mobile devices, the Internet, the Internet of Things (IOT), electronic transactions, and other emerging sources. Analytics leaders report significantly higher revenue growth than other organizations that have failed to invest, according to Cisco's research (see the report Advanced Analytics: The Key to Becoming a Data-Driven Enterprise.)

Real-time data can improve customer service and operational efficiency by providing immediate transparency, allowing systems and support personnel to deliver the appropriate information or services when they’re needed. Real-time data also is crucial to timely fraud detection and business continuity, enabling system performance issues to be addressed before they become costly outages.

But the digital data explosion also presents organizations with serious challenges. Not only must all data – much of which is unstructured or siloed — be extracted, scrubbed, secured, and managed, it must be analyzed in real-time or near real-time to facilitate timely decisions or actions. And that's hard to do.

Fortunately, a growing number of tools and techniques are available to enterprises seeking to collect increasingly rich and varied streams of data and convert it into actionable information in near real-time. Here are six such tools/techniques for enterprise CIOs to consider.

1. Data Lakes

Imagine a body of water containing a wide variety of life forms. A data lake is the digital version of this body of water, but instead of frogs and turtles, they contain data in various forms – either raw or structured, and typically in many formats (spreadsheets, emails, PDFs, multimedia, etc.).

This data can be fished out of the data lake by multiple parties or programs for multiple uses – customer service, analytics, routine reporting, systems monitoring, transaction processing, machine learning, and more.

For many organizations that need scalable storage and easy accessibility to vast amounts of data, a data lake can make a lot of sense, in large part because it gets all the enterprise’s data out of specific siloes into one location.

However, just as an untended lake can become murky and overgrown, an unmanaged data lake can slow down and taint the analytics process, defeating the purpose of the data lake in the first place.

That means applying data management practices to data lakes. In particular, data residing in a data lake must be easily retrievable through queries. “That capability must be built in to the data lake through unique and rich metadata tags,” according to Data Insider. “Without these tags, the data lake quickly devolves into what industry insiders have dubbed the data swamp.”

Still, for organizations willing to invest in scalable storage infrastructure and management, data lakes can be a valuable resource.

2. Data Pipelining

Organizations collect vast amounts of data they want to be analyzed as quickly as possible. Rather than move all that data prior to analysis, it would be much easier and faster to perform analytics where the data is generated, right?

Not always. For starters, conducting data analysis at the point of collection puts a tremendous computing burden on the host systems, which can lead to degradation of both the data collection and analysis processes. Further, should an organization need to make changes in how (or what) data is stored, having a separate system can mitigate risk as these changes are made.

Data pipelining is an approach that encompasses all the steps necessary for moving and managing data – including making backup copies, migrating data to a cloud, reformatting, and merging the relocated data with other data. Data pipelines and data lakes are not mutually exclusive; data pipelines commonly are used to move data into data lakes.

A data pipeline is very similar to a manufacturing assembly line in that the job is broken down into multiple parts as the “product” moves through the system. While the components of a typical data pipeline are automated, IT pros inevitably must monitor, update, and troubleshoot the system to keep it running.

3. Improved Data Extraction Tools

The initial step in the data process – extraction – is complicated by the sheer volume of data and growing number of data formats and sources. Data extraction tools employ crawlers equipped with machine learning algorithms to automatically identify and collect structured and unstructured data from websites, devices, and other content sources such as emails, texts, scanned documents, spreadsheets, and PDF files. The variety of tools and capabilities continues to expand, giving CIOs more ways to approach the task.

Data extraction tools save time by quickly identifying and structuring information that can be analyzed to produce strategically or operationally important data. On the strategic side, these tools can locate and extract data from internal sources such as CRM platforms and server logs or external sources such as social media sites and online business directories.

Critically, because an increasing number of these tools employ machine learning algorithms, data extraction in a given company can become even more efficient over time.

4. Data Mitosis

Cloud computing has helped solved the problem of data storage for enterprises in recent years, virtually eliminating storage capacity problems – and at reasonable (and ever-falling) prices. Serverless computing continues that trend.

Cheap and plentiful storage allows enterprises to replicate data for placement in multiple tables that are optimized for specific queries, rather than running those queries on big data stores. Not surprisingly, this replication process – dubbed data mitosis, named after the biological process in which cells divide into identical entities – increases access speed to data, moving organizations closer to real-time analytics and decision-making. According to a Google Cloud Platform blog post, Google uses data mitosis across several critical services including Gmail and Google Maps.

As one might expect, data mitosis requires careful data synchronization to ensure that the latest, most accurate data is being accessed from everywhere it resides. This may entail continually replacing old data rows with new, or simply adding new rows.

5. Stream Processing

Another way to accelerate the analysis of large volumes of data is to use stream processing, in which queries are conducted on the fly, in real time, as data flows through servers. The processing of real-time streamed data is called streaming analytics.

Stream processing evolved from batch processing, in which large chunks of data are processed at scheduled times (such as after hours). But as the digital economy accelerated the pace of business, enterprises began running smaller batches of data more frequently out of competitive necessity.

At the same time, however, consumer and connected devices were fueling an explosion of data, making it even more difficult for batch processing to keep up with the real-time requirements of business.

Real-time processing of streamed data became the obvious solution.

Stream processing first was deployed in the finance industry, where immediate response to buying and selling opportunities is a matter of survival. But its use has spread to other industries and activities – such as fraud detection, e-commerce transactions, and network monitoring – that rely on real-time data analysis to make decisions and trigger actions.

Connected devices are another data source ideally suited for stream processing because they produce a continual flow of data in large volumes. As the Internet of Things (IoT) continues to proliferate, stream processing should only increase in popularity.

6. Metadata Management

When you’re dealing with unprecedented amounts of data that needs to be processed as rapidly as possible, traditional data management won’t cut it. Enter metadata management, a practice which offers a unified view of data that enables enterprise decision-makers to get a handle on data governance, including compliance issues and operational efficiency.

Metadata management provides clarity about an enterprise’s data by better organizing and using information about that data. This is useful in a number of ways. For example, metadata reconciles different terms used across data sources for the same thing: One data repository may use the word “customer,” while another uses “clients.” For the purposes of analytics, tt may be faster and more practical to reconcile that difference using metadata management rather than changing the applications producing the data.

Metadata also includes information about where data originated, as well as how data was accessed, altered, or deleted, and by whom or what (an application or process).

One of the most useful (and the most widely used) types is technical metadata, “a key metadata type used to build and maintain the enterprise data environment,” TechTarget writes. “Technical metadata typically includes database system names, table and column names and sizes, data types and allowed values, and structural information such as primary and foreign key attributes and indices.”