Tapping into the power of Big Data

Transcripción

Technologyforecast
Making sense
of Big Data
A quarterly journal
2010, Issue 3
In this issue
04
22
36
Tapping into the
power of Big Data
Building a bridge
to the rest of your data
Revising the CIO’s
data playbook
Contents
Features
04
Treating it differently from your core enterprise data is essential.
22
Building a bridge to the rest of your data
How companies are using open-source cluster-computing
techniques to analyze their data.
36
Revising the CIO’s data playbook
Start by adopting a fresh mind-set, grooming the right talent,
and piloting new tools to ride the next wave of innovation.
Interviews
14
The data scalability challenge
John Parkinson of TransUnion describes the data handling issues
more companies will face in three to five years.
18
Creating a cost-effective Big Data strategy
Disney’s Bud Albers, Scott Thompson, and Matt Estes outline an
agile approach that leverages open-source and cloud technologies.
34
Hadoop’s foray into the enterprise
Cloudera’s Amr Awadallah discusses how and why diverse
companies are trying this novel approach.
46
New approaches to customer data analysis
Razorfish’s Mark Taylor and Ray Velez discuss how new techniques
enable them to better analyze petabytes of Web data.
Departments
02
Message from the editor
50
Acknowledgments
54
Subtext
Message from
the editor
Bill James has loved baseball statistics ever since he was a kid in Mayetta,
Kansas, cutting baseball cards out of the backs of cereal boxes in the early
1960s. James, who compiled The Bill James Baseball Abstract for years, is
a renowned “sabermetrician” (a term he coined himself). He now is a senior
advisor on baseball operations for the Boston Red Sox, and he previously
worked in a similar capacity for other Major League Baseball teams.
James has done more to change the world of baseball statistics than
anyone in recent memory. As broadcaster Bob Costas says, James
“doesn’t just understand information. He has shown people a different
way of interpreting that information.” Before Bill James, Major League
Baseball teams all relied on long-held assumptions about how games are
won. They assumed batting average, for example, had more importance
than it actually does.
James challenged these assumptions. He asked critical questions that
didn’t have good answers at the time, and he did the research and analysis
necessary to find better answers. For instance, how many days’ rest does
a reliever need? James’s answer is that some relievers can pitch well for
two or more consecutive days, while others do better with a day or two of
rest in between. It depends on the individual. Why can’t a closer work more
than just the ninth inning? A closer is frequently the best reliever on the
team. James observes that managers often don’t use the best relievers
to their maximum potential.
The lesson learned from the Bill James example is that the best statistics
come from striving to ask the best questions and trying to get answers to
those questions. But what are the best questions? James takes an iterative
approach, analyzing the data he has, or can gather, asking some questions
based on that analysis, and then looking for the answers. He doesn’t stop
with just one set of statistics. The first set suggests some questions, to
which a second set suggests some answers, which then give rise to yet
another set of questions. It’s a continual process of investigation, one that’s
focused on surfacing the best questions rather than assuming those
questions have already been asked.
Enterprises can take advantage of a similarly iterative, investigative
approach to data. Enterprises are being overwhelmed with data; many
enterprises each generate petabytes of information they aren’t making best
use of. And not all of the data is the same. Some of it has value, and some,
not so much.
The problem with this data has been twofold: (1) it’s difficult to analyze,
and (2) processing it using conventional systems takes too long and is
too expensive.
02
PricewaterhouseCoopers Technology Forecast
Addressing these problems effectively doesn’t require
radically new technology. Better architectural design
choices and software that allows a different approach
to the problems are enough. Search engine companies
such as Google and Yahoo provide a pragmatic way
forward in this respect. They’ve demonstrated that
efficient, cost-effective, system-level design can lead
to an architecture that allows any company to handle
different data differently.
“Revising the CIO’s data playbook,” on page 36
emphasizes that CIOs have time to pick and choose
the most suitable approach. The most promising
opportunity is in the area of “gray data,” or data that
comes from a variety of sources. This data is often raw
and unvalidated, arrives in huge quantities, and doesn’t
yet have established value. Gray data analysis requires
a different skill set—people who are more exploratory
by nature.
Enterprises shouldn’t treat voluminous, mostly
unstructured information (for example, Web server
log files) the same way they treat the data in core
transactional systems. Instead, they can use commodity
computer clusters, open-source software, and Tier 3
storage, and they can process in an exploratory way
the less-structured kinds of data they’re generating.
With this approach, they can do what Bill James does
and find better questions to ask.
As always, in this issue we’ve included interviews with
knowledgeable executives who have insights on the
overall topic of interest:
In this issue of the Technology Forecast, we review the
techniques behind low-cost distributed computing that
have led companies to explore more of their data in new
ways. In the article, “Tapping into the power of Big
Data,” on page 04, we begin with a consideration of
exploratory analytics—methods that are separate from
traditional business intelligence (BI). These techniques
make it feasible to look for more haystacks, rather than
just the needle in one haystack.
• Amr Awadallah of Cloudera explores the reasons
behind Apache Hadoop’s adoption at search engine,
social media, and financial services companies.
The article, “Building a bridge to the rest of your data,”
on page 22 highlights the growing interest in and
adoption of Hadoop clusters. Hadoop provides highvolume, low-cost computing with the help of opensource software and hundreds or thousands of
commodity servers. It also offers a simplified
approach to processing more complex data in parallel.
The methods, cost advantages, and scalability of
Hadoop-style cluster computing clear a path for
enterprises to analyze lots of data they didn’t have
the means to analyze before.
The buzz around Big Data and “cloud storage” (a term
some vendors use to describe less-expensive clustercomputing techniques) is considerable, but the article,
Message from the editor
• John Parkinson of TransUnion describes the data
challenges that more and more companies will face
during the next three to five years.
• Bud Albers, Scott Thompson, and Matt Estes of Disney
outline an agile, open-source cloud data vision.
• Mark Taylor and Ray Velez of Razorfish contrast
newer, more scalable techniques of studying
customer data with the old methods.
Please visit pwc.com/techforecast to find these articles
and other issues of the Technology Forecast online.
If you would like to receive future issues of the
Technology Forecast as a PDF attachment, you
can sign up at pwc.com/techforecast/subscribe.
We welcome your feedback and your ideas for future
research and analysis topics to cover.
Tom DeGarmo
Principal
Technology Leader
[email protected]
03
Tapping into the
power of Big Data
Treating it differently from your core enterprise data is essential.
By Galen Gruman
04
Like most corporations, the Walt Disney Co. is
swimming in a rising sea of Big Data: information
collected from business operations, customers,
transactions, and the like; unstructured information
created by social media and other Web repositories,
including the Disney home page itself and sites for
its theme parks, movies, books, and music; plus the
sites of its many big business units, including ESPN
and ABC.
“In any given year, we probably generate more data
than the Walt Disney Co. did in its first 80 years of
existence,” observes Bud Albers, executive vice
president and CTO of the Disney Technology Shared
Services Group. “The challenge becomes what do
you do with it all?”
Albers and his team are in the early stages of
answering their own question with an economical
cluster-computing architecture based on a set of
cost-effective and scalable technologies anchored
by Apache Hadoop, an open-source, Java-based
distributed file system based on Google File System
and developed by Apache Software Foundation.
These still-emerging technologies allow Disney analysts
to explore multiple terabytes of information without the
lengthy time requirements or high cost of traditional
business intelligence (BI) systems.
platform make it feasible not only to look for the
needle in the haystack, but also to look for new
haystacks. This kind of analysis demands an attitude
of exploration—and the ability to generate value from
data that hasn’t been scrubbed or fully modeled into
relational tables.
Using Disney and other examples, this first article
introduces the idea of exploratory BI for Big Data.
The second article examines Hadoop clusters and
technologies that support them (page 22), and the
third article looks at steps CIOs can take now to exploit
the future benefits (page 36). We begin with a closer
look at Disney’s still-nascent but illustrative effort.
“In any given year, we probably
generate more data than the
Walt Disney Co. did in its first
80 years of existence.”
—Bud Albers of Disney
This issue of the Technology Forecast examines how
Apache Hadoop and these related technologies can
derive business value from Big Data by supporting a
new kind of exploratory analytics unlike traditional BI.
These software technologies and their hardware cluster
05
Bringing Big Data under control
Big Data is not a precise term; rather, it is a
characterization of the never-ending accumulation of
all kinds of data, most of it unstructured. It describes
data sets that are growing exponentially and that are
too large, too raw, or too unstructured for analysis
using relational database techniques. Whether
terabytes or petabytes, the precise amount is less the
issue than where the data ends up and how it is used.
Like everyone else, Disney’s Big Data is huge, more
unstructured than structured, and growing much
faster than transactional data.
The Disney Technology Shared Services Group, which
is responsible for Disney’s core Web and analysis
technologies, recently began its Big Data efforts but
already sees high potential. The group is testing the
technology and working with analysts in Disney
business units. Disney’s data comes from varied
sources, but much of it is collected for departmental
business purposes and not yet widely shared. Disney’s
Big Data approach will allow it to look at diverse data
sets for unplanned purposes and to uncover patterns
across customer activities. For example, insights from
Disney Store activities could be useful in call centers
for theme park booking or to better understand the
audience segments of one of its cable networks.
The Technology Shared Services Group is even using
Big Data approaches to explore its own IT questions
to understand what data is being stored, how it is
used, and thus what type of storage hardware and
management the group needs.
Albers assumes that Big Data analysis is destined to
become essential. “The speed of business these days
and the amount of data that we are now swimming
in mean that we need to have new ways and new
techniques of getting at the data, finding out what’s in
there, and figuring out how we deal with it,” he says.
The team stumbled upon an inexpensive way to
improve the business while pursuing more IT costeffectiveness through the use of private-cloud
technologies. (See the Technology Forecast, Summer
2009, for more on the topic of cloud computing.) When
Albers launched the effort to change the division’s cost
curve so IT expenses would rise more slowly than the
business usage of IT—the opposite had been true—he
turned to an approach that many companies use to
make data centers more efficient: virtualization.
Virtualization offers several benefits, including higher
utilization of existing servers and the ability to move
workloads to prevent resource bottlenecks. An
organization can also move workloads to external cloud
providers, using them as a backup resource when
needed, an approach called cloud bursting. By using
such approaches, the Disney Technology Shared
Services Group lowered its IT expense growth rate from
27 percent to –3 percent, while increasing its annual
processing growth from 17 percent to 45 percent.
While achieving this efficiency, the team realized that
the ability to move resources and tap external ones
could apply to more than just data center efficiency. At
first, they explored using external clouds to analyze big
sets of data, such as Web traffic to Disney’s many sites,
and to handle big processing jobs more cost-effectively
and more quickly than with internal systems.
During that exploration, the team discovered Hadoop,
MapReduce, and other open-source technologies that
distribute data-analysis workloads across many
computers, breaking the analysis into many parallel
workloads that produce results faster. Faster results
mean that more questions can be asked, and the low
cost of the technologies means the team can afford
to ask those questions.
Disney assembled a Hadoop cluster and set up
a central logging service to mine data that the
organization hadn’t been able to before. It will begin
to provide internal group access to the cluster in
October 2010. Figure 1 shows how the Hadoop
cluster will benefit internal groups, business partners,
and customers.
“The speed of business these days and the amount of data that we
are now swimming in mean that we need to have new ways and new
techniques of getting at the data, finding out what’s in there, and figuring
out how we deal with it.” —Bud Albers of Disney
06
Improved
experience
4
Internal
business
partners
Site
visitors
Affiliated
businesses
Interface to cluster
(MapReduce/Hive/Pig)
1
Usage
data
D-Cloud data cluster
2
Central
logging
service
Core IT and
business unit
systems
3
Hadoop
Metadata
repository
Figure 1: Disney’s Hadoop cluster and central logging service
Disney’s new D-Cloud data cluster can scale to handle (1) less-structured usage data through the establishment of (2) a central
logging service, (3) a cost-effective Hadoop data analysis engine, and a commodity computer cluster. The result is (4) a more
responsive and personalized user experience.
Source: Disney, 2010
07
Simply put, the low cost of a Hadoop cluster means
freedom to experiment. Disney uses a couple of dozen
servers that were scheduled to be retired, and the
organization operates its cluster with a handful of
existing staff. Matt Estes, principal data architect
for the Disney Technology Shared Services Group,
estimates the cost of the project at $300,000
to $500,000.
Opportunities for Big Data insights
Here are other examples of the kinds of insights
that may be gleaned from analysis of Big Data
information flows:
• Customer churn, based on analysis of call center,
help desk, and Web site traffic patterns
“Before, I would have needed to figure on spending $3
million to $5 million for such an initiative,” Albers says.
“Now I can do this without charging to the bottom line.”
• Changes in corporate reputation and the potential
for regulatory action, based on the monitoring of
social networks as well as Web news sites
Unlike the reusable canned queries in typical BI systems,
Big Data analysis does require more effort to write the
queries and the data-parsing code for what are often
unique inquiries of data sources. But Albers notes that
“the risk is lower due to all the other costs being lower.”
Failure is inexpensive, so analysts are more willing to
explore questions they would otherwise avoid.
• Real-time demand forecasting, based on
disparate inputs such as weather forecasts,
travel reservations, automotive traffic, and
retail point-of-sale data
Even in this early stage, Albers is confident that the
ability to ask more questions will lead to more insights
that translate to both the bottom line and the top line.
For example, Disney already is seeking to boost
customer engagement and spending by making
recommendations to customers based on pattern
analysis of their online behavior.
How Big Data analysis is different
What should other enterprises anticipate from Hadoopstyle analytics? It is a type of exploratory BI they haven’t
done much before. This is business intelligence that
provides indications, not absolute conclusions. It requires
a different mind-set, one that begins with exploration, the
results of which create hypotheses that are tested before
moving on to validation and consolidation.
These methods could be used to answer questions
such as, “What indicators might there be that predate
a surge in Web traffic?” or “What fabrics and colors
are gaining popularity among influencers, and what
sources might be able to provide the materials to us?”
or “What’s the value of an influencer on Web traffic
through his or her social network?” See the sidebar
“Opportunities for Big Data insights” for more examples
of the kinds of questions that can be asked of Big Data.
08
• Supply chain optimization, based on analysis of
weather patterns, potential disaster scenarios,
and political turmoil
Disney and others explore their data without a lot of
preconceptions. They know the results won’t be as
specific as a profit-margin calculation or a drug-efficacy
determination. But they still expect demonstrable value,
and they expect to get it without a lot of extra expense.
Typical BI uses data from transactional and other
relational database management systems (RDBMSs)
that an enterprise collects—such as sales and
purchasing records, product development costs, and
new employee hire records—diligently scrubs the data
for accuracy and consistency, and then puts it into a
form the BI system is programmed to run queries
against. Such systems are vital for accurate analyses
of transactional information, especially information
subject to compliance requirements, but they don’t
work well for messy questions, they’ve been too
expensive for questions you’re not sure there’s any
value in asking, and they haven’t been able to scale
to analyze large data sets efficiently. (See Figure 2.)
Large
data sets
Small
data sets
Big Data (via
Hadoop/MapReduce)
Little analytical value
Non-relational data
Less scalability
Traditional BI
Relational data
Figure 2: Where Big Data fits in
Other companies have also tapped into the excitement
brewing over Big Data technologies. Several Weboriented companies that have always dealt with huge
amounts of data—such as Yahoo, Twitter, and
Google—were early adopters. Now, more traditional
companies—such as TransUnion, a credit rating
service—are exploring Big Data concepts,
having seen the cost and scalability benefits
the Web companies have realized.
Specifically, enterprises are also motivated by the
inability to scale their existing approach for working
on traditional analytics tasks, such as querying
across terabytes of relational data. They are learning
that the tools associated with Hadoop are uniquely
positioned to explore data that has been sitting on
the side, unanalyzed. Figure 3 illustrates how the data
architecture landscape appears in 2010. Enterprises
with high processing power requirements and
centralized architectures are facing scaling issues.
Source: PricewaterhouseCoopers, 2010
In contrast, Big Data techniques allow you to sift through
data to look for patterns at a much lower cost and in
much less time than traditional BI systems. Should the
data end up being so valuable that it requires the
ongoing, compliance-oriented analysis of regular BI
systems, only then do you make that investment.
Big Data approaches let you ask more questions of
more information, opening a wide range of potential
insights you couldn’t afford to consider in the past.
“Part of the analytics role is to challenge assumptions,”
Estes says. BI systems aren’t designed to do that;
instead, they’re designed to dig deeper into known
questions and look for variations that may indicate
deviations from expected outcomes.
Furthermore, Big Data analysis is usually iterative: you
ask one question or examine one data set, then think of
more questions or decide to look at more data. That’s
different from the “single source of truth” approach to
standard BI and data warehousing. The Disney team
started with making sure they could expose and
access the data, then moved to iterative refinement in
working with the data. “We aggressively got in to find
the direction and the base. Then we began to iterate
rather than try to do a Big Bang,” Albers says.
High
processing
power
Low
processing
power
Enterprises facing
scaling and
capacity/cost
problems
Google, Amazon,
Facebook, Twitter,
etc. (all use nonrelational data stores
for reasons of scale)
Most enterprises
Cloud users with
low compute
requirements
Centralized
compute
architecture
Distributed
compute
architecture
Figure 3: The data architecture landscape in 2010
Wolfram Research and IBM have begun to extend
their analytics applications to run on such large-scale
data pools, and startups are presenting approaches
they promise will allow data exploration in ways that
technologies couldn’t have enabled in the past,
including support for tools that let knowledge workers
examine traditional databases using Big Data–style
exploratory tools.
09
The ways different enterprises approach Big Data
It should come as no surprise that organizations
dealing with lots of data are already investigating Big
Data technologies, or that they have mixed opinions
about these tools.
“At TransUnion, we spend a lot of our time trawling
through tens or hundreds of billions of rows of data,
looking for things that match a pattern approximately,”
says John Parkinson, TransUnion’s acting CTO. “We
want to do accurate but approximate matching and
categorization in very large low-structure data sets.”
Parkinson has explored Big Data technologies such
as MapReduce that appear to have a more efficient
filtering model than some of the pattern-matching
algorithms TransUnion has tried in the past.
“MapReduce also, at least in its theoretical formulation,
is very amenable to highly parallelized execution,”
which lets the users tap into farms of commodity
hardware for fast, inexpensive processing, he notes.
However, Parkinson thinks Hadoop and MapReduce
are too immature. “MapReduce really hasn’t evolved
yet to the point where your average enterprise
technologist can easily make productive use of it. As
for Hadoop, they have done a good job, but it’s like a
lot of open-source software—80 percent done. There
were limits in the code that broke the stack well before
what we thought was a good theoretical limit.”
Parkinson echoes many IT executives who are
skeptical of open-source software in general. “If I have
a bunch of engineers, I don’t want them spending their
day being the technology support environment for what
should be a product in our architecture,” he says.
That’s a legitimate point of view, especially considering
the data volumes TransUnion manages—8 petabytes
from 83,000 sources in 4,000 formats and growing—
and its focus on mission-critical capabilities for this
data. Credit scoring must run successfully and deliver
top-notch credit scores several times a day. It’s an
operational system that many depend on for critical
business decisions that happen millions of times a
day. (For more on TransUnion, see the interview with
Parkinson on page 14.)
10
Disney’s system is purely intended for exploratory
efforts or at most for reporting that eventually may feed
up to product strategy or Web site design decisions.
If it breaks or needs a little retooling, there’s no crisis.
But Albers disagrees about the readiness of the tools,
noting that the Disney Technology Shared Services
Group also handles quite a bit of data. He figures
Hadoop and MapReduce aren’t any worse than a lot of
proprietary software. “I fully expect we will run on things
that break,” he says, adding facetiously, “Not that any
commercial product I’ve ever had has ever broken.”
Data architect Estes also sees responsiveness in
open-source development that’s laudable. “In our
testing, we uncovered stuff, and you get somebody
on the other end. This is their baby, right? I mean,
they want it fixed.”
Albers emphasizes the total cost-effectiveness of
Hadoop and MapReduce. “My software cost is zero.
You still have the implementation, but that’s a constant
at some level, no matter what. Now you probably need
to have a little higher skill level at this stage of the
game, so you’re probably paying a little more, but
certainly, you’re not going out and approving a Teradata
cluster. You’re talking about Tier 3 storage. You’re
talking about a very low level of cost for the storage.”
Albers’ points are also valid. PricewaterhouseCoopers
predicts these open-source tools will be solid sooner
rather than later, and are already worthy of use in
non-mission-critical environments and applications.
Hence, in the CIO article on page 36, we argue in favor
of taking cautious but exploratory steps.
Asking new business questions
Saving money is certainly a big reward, but
PricewaterhouseCoopers contends the biggest
payoff from Hadoop-style analysis of Big Data is the
potential to improve organizations’ top line. “There’s
a lot of potential value in the unstructured data in
organizations, and people are starting to look at it
more seriously,” says Tom Urquhart, chief architect
at PricewaterhouseCoopers. Think of it as a “Google
in a box, which allows you to do intelligent search
regardless of whether the underlying content is
structured or unstructured,” he says.
The Google-style techniques in Hadoop, MapReduce,
and related technologies work in a fundamentally
different way from traditional BI systems, which use
strictly formatted data cubes pulling information from
data warehouses. Big Data tools let you work with data
that hasn’t been formally modeled by data architects,
so you can analyze and compare data of different types
and of different levels of rigor. Because these tools
typically don’t discard or change the source data
before the analysis begins, the original context
remains available for drill-down by analysts.
These tools provide technology assistance to a very
human form of analysis: looking at the world as it is
and finding patterns of similarity and difference, then
going deeper into the areas of interest judged valuable.
In contrast, BI systems know what questions should be
asked and what answers to expect; their goal is to look
for deviations from the norm or changes in standard
patterns deemed important to track (such as changes
in baseline quality or in sales rates in specific
geographies). Such an approach, absent an exploratory
phase, results in a lot of information loss during data
consolidation. (See Figure 4.)
Pattern analysis mashup services
There’s another use of Big Data that combines
efficiency and exploratory benefits: on-the-fly pattern
analysis from disparate sources to return real-time
results. Amazon.com pioneered Big Data–based
product recommendations by analyzing customer data,
including purchase histories, product ratings, and
comments. Albers is looking for similar value that
would come from making live recommendations to
customers when they go to a Disney site, store, or
reservations phone line—based on their previous
online and offline behavior with Disney.
O’Reilly Media, a publisher best known for technical
books and Web sites, is working with the White House
to develop mashup applications that look at data from
various sources to identify patterns that might help
lobbyists and policymakers. For example, by mashing
together US Census data and labor statistics, they can
see which counties have the most international and
domestic immigration, then correlate those attributes
with government spending changes, says Roger
Magoulas, O’Reilly’s research director.
Exploration
Pre-consolidated data (never collected)
All collected data
tio
n
Insight
da
Information
loss
oli
ns
on
ati
lid
o
ns
Summary
enterprise
data
Information
loss
Co
Co
Summary
departmental
data
All collected data
Summary
departmental
data
Less information loss
Information
loss
Summary
enterprise
data
Greater insight
Figure 4: Information loss in the data consolidation process
11
Mashups like this can also result in customer-facing
services. FlightCaster for iPhone and BlackBerry uses
Big Data approaches to analyze flight-delay records
and current conditions to issue flight-delay predictions
to travelers.
Exploiting the power of human analysis
Big Data approaches can lower processing and storage
costs, but we believe their main value is to perform the
analysis that BI systems weren’t designed for, acting as
an enabler and an amplifier of human analysis.
Ad hoc exploration at a bargain
Big Data lets you inexpensively explore questions
and peruse data for patterns that may indicate
opportunities or issues. In this arena, failure is cheap,
so analysts are more willing to explore questions they
would otherwise avoid. And that should lead to insights
that help the business operate better.
Medical data is an example of the potential for ad hoc
analysis. “A number of such discoveries are made on
the weekends when the people looking at the data
are doing it from the point of view of just playing
around,” says Doug Lenat, founder and CEO of
Cycorp and a former professor at Stanford and
Carnegie Mellon universities.
Right now the technical knowledge required to use
these tools is nontrivial. Imagine the value of extending
the exploratory capability more broadly. Cycorp is
one of many startups trying to make Big Data analytic
capabilities usable by more knowledge workers so
they can perform such exploration.
Analyzing data that wasn’t designed for BI
Big Data also lets you work with “gray data,” or data
from multiple sources that isn’t formatted or vetted for
your specific needs, and that varies significantly in its
level of detail and accuracy—and thus cannot be
examined by BI systems.
One analogy is Wikipedia. Everyone knows its
information is not rigorously managed or necessarily
accurate; nonetheless, Wikipedia is a good first place
to look for indicators of what may be true and useful.
From there, you do further research using a mix of
12
information resources whose accuracy and
completeness may be more established.
People use their knowledge and experience to
appropriately weigh and correlate what they find across
gray data to come up with improved strategies to aid
the business. Figure 5 compares gray data and more
normalized black data.
Black data
Classified
Provenanced
Cleaned
Actual
Gray data
Raw
Data and context commingled
Noisy
Hypothetical
e.g., Financial system data
e.g., Wikipedia
Reviewed
Confirming
More trustworthy
Managed by IT
Unchecked
Indicative
Less trustworthy
Managed by business unit
Figure 5: Gray versus black data
Web analytics and financial risk analysis are two
examples of how Big Data approaches augment human
analysts. These techniques comb huge data sets of
information collected for specific purposes (such as
monitoring individual financial records), looking for
patterns that might identify good prospects for loans
and flag problem borrowers. Increasingly, they comb
external data not collected by a credit reporting
agency—for example, trends in a neighborhood’s
housing values or in local merchants’ sales patterns—
to provide insights into where sales opportunities could
be found or where higher concentrations of problem
customers are located.
The same approaches can help identify shifts in
consumer tastes, such as for apparel and furniture.
And, by analyzing gray data related to costs of
resources and changes in transportation schedules,
these approaches can help anticipate stresses on
suppliers and help identify where additional suppliers
might be found.
All of these activities require human intelligence,
experience, and insight to make sense of the data,
figure out the questions to ask, decide what
information should be correlated, and generally
conduct the analysis.
Why the time is ripe for Big Data
Conclusion
The human analysis previously described is old hat
for many business analysts, whether they work in
manufacturing, fashion, finance, or real estate. What’s
changing is scale. As noted, many types of information
are now available that never existed or were not
accessible. What could once only be suggested
through surveys, focus groups, and the like can now
be examined directly, because more of the granular
thinking and behaviors are captured. Businesses have
the potential to discover more through larger samples
and more granular details, without relying on people
to recall behaviors and motivations accurately.
PricewaterhouseCoopers believes that Big Data
approaches will become a key value creator for
businesses, letting them tap into the wild, woolly
world of information heretofore out of reach. These
new data management and storage technologies can
also provide economies of scale in more traditional
data analysis. Don’t limit yourself to the efficiencies
of Big Data and miss out on the potential for gaining
insights through its advantages in handling the gray
data prevalent today.
This potential can be realized only if you pull together
and analyze all that data. Right now, there’s simply
too much information for individual analysts to
manage, increasing the chances of missing potential
opportunities or risks. Businesses that augment their
human experts with Big Data technologies could have
significant competitive advantages by heading off
problems sooner, identifying opportunities earlier,
and performing mass customization at a larger scale.
Fortunately, the emerging Big Data tools should let
businesspeople apply individual judgments to vaster
pools of information, enabling low-cost, ad hoc
analysis never before feasible. Plus, as patterns
are discovered, the detection of some can be
automated, letting the human analysts concentrate
on the art of analysis and interpretation that algorithms
can’t accomplish.
Even better, emerging Big Data technologies promise
to extend the reach of analysis beyond the cadre of
researchers and business analysts. Several startups
offer new tools that use familiar data-analysis tools—
similar to those for SQL databases and Excel
spreadsheets—to explore Big Data sources, thus
broadening the ability to explore to a wider set of
knowledge workers.
Finally, Big Data approaches can be used to power
analytics-based services that improve the business
itself, such as in-context recommendations to
customers, more accurate predictions of service
delivery, and more accurate failure predictions
(such as for the manufacturing, energy, medical,
and chemical industries).
Big Data analysis does not replace other systems.
Rather, it supplements the BI systems, data
warehouses, and database systems essential to
financial reporting, sales management, production
management, and compliance systems. The difference
is that these information systems deal with the knowns
that must meet high standards for rigor, accuracy, and
compliance—while the emerging Big Data analytics
tools help you deal with the unknowns that could affect
business strategy or its execution.
As the amount and interconnectedness of data vastly
increases, the value of the Big Data approach will only
grow. If the amount and variety of today’s information is
daunting, think what the world will be like in 5 or 10
years. People will become mobile sensors—collecting,
creating, and transmitting all sorts of information, from
locations to body status to environmental information.
We already see this happening as smartphones
equipped with cameras, microphones, geolocation,
and compasses proliferate. Wearable medical sensors,
small temperature tags for use on packages, and other
radio-equipped sensors are a reality. They’ll be the
Twitter and Facebook feeds of tomorrow, adding vast
quantities of new information that could provide
context on behavior and environment never before
possible—and a lot of “noise” certain to mask
what’s important.
Insight-oriented analytics in this sea of information—
where interactions cause untold ripples and eddies in
the flow and delivery of business value—will become
a critical competitive requirement. Big Data technology
is the likeliest path to gaining such insights.
13
The data scalability challenge
John Parkinson of TransUnion describes the
data handling issues more companies will face
in three to five years.
Interview conducted by Vinod Baya and Alan Morrison
John Parkinson is the acting CTO of TransUnion, the chairman and owner of Parkwood
Advisors, and a former CTO at Capgemini. In this interview, Parkinson outlines
TransUnion’s considerable requirements for less-structured data analysis, shedding
light on the many data-related technology challenges TransUnion faces today—challenges
he says that more companies will face in the near future.
PwC: In your role at TransUnion, you’ve
evaluated many large-scale data processing
technologies. What do you think of Hadoop
and MapReduce?
JP: MapReduce is a very computationally attractive
answer for a certain class of problem. If you have that
class of problem, then MapReduce is something you
should look at. The challenge today, however, is that the
number of people who really get the formalism behind
MapReduce is a lot smaller than the group of people
trying to understand what to do with it. It really hasn’t
evolved yet to the point where your average enterprise
technologist can easily make productive use of it.
PwC: What class of problem would that be?
JP: MapReduce works best in situations where you
want to do high-volume, accurate but approximate
matching and categorization in very large, lowstructured data sets. At TransUnion, we spend a lot of
our time trawling through tens or hundreds of billions
14
of rows of data looking for things that match a pattern
approximately. MapReduce is a more efficient filter for
some of the pattern-matching algorithms that we have
tried to use. At least in its theoretical formulation, it’s
very amenable to highly parallelized execution, which
many of the other filtering algorithms we’ve used aren’t.
The open-source stack is attractive for experimenting,
but the problem we find is that Hadoop isn’t what
Google runs in production—it’s an attempt by a bunch
of pretty smart guys to reproduce what Google runs in
production. They’ve done a good job, but it’s like a lot
of open-source software—80 percent done. The
20 percent that isn’t done—those are the hard parts.
From an experimentation point of view, we have had a
lot of success in proving that the computing formalism
behind MapReduce works, but the software that we
can acquire today is very fragile. It’s difficult to manage.
It has some bugs in it, and it doesn’t behave very well
in an enterprise environment. It also has some
interesting limitations when you try to push the
scale and the performance.
We found a number of representational problems
when we used the HDFS/Hadoop/HBase stack to
do something that, according to the documentation
available, should have worked. However, in practice,
limits in the code broke the stack well before what we
thought was a good theoretical limit.
Now, the good news of course is that you get source
code. But that’s also the bad news. You need to get the
source code, and that’s not something that we want to
do as part of routine production. I have a bunch of
smart engineers, but I don’t want them spending their
day being the technology support environment for what
should be a product in our architecture. Yes, there’s a
pony there, but it’s going to be awhile before it stabilizes
to the point that I want to bet revenue on it.
PwC: Data warehousing appliance prices have
dropped pretty dramatically over the past couple
of years. When it comes to data that’s not
necessarily on the critical path, how does an
enterprise make sure that it is not spending
more than it has to?
JP: We are probably not a good representational
example of that because our business is analyzing the
data. There is almost no price we won’t pay to get a
better answer faster, because we can price that into
the products we produce. The challenge we face is
that the tools don’t always work properly at the edge
of the envelope. This is a problem for hardware as
well as software. A lot of the vendors stop testing
their applications at about 80 percent or 85 percent
of their theoretical capability. We routinely run them at
110 percent of their theoretical capability, and they
break. I don’t mind making tactical justifications for
technologies that I expect to replace quickly. I do that
all the time. But having done that, I want the damn
thing to work. Too often, we’ve discovered that it
doesn’t work.
PwC: Are you forced to use technologies that
have matured because of a wariness of things
on the absolute edge?
JP: My dilemma is that things that are known to work
usually don’t scale to what we need—for speed or full
capacity. I must spend some time, energy, and dollars
betting on things that aren’t mature yet, but that can be
sufficiently generalized architecturally. If the one I pick
doesn’t work, or goes away, I can fit something else into
its place relatively easily. That’s why we like appliances.
As long as they are well behaved at the network layer
and have a relatively generalized or standards-based
business semantic interface, it doesn’t matter if I have
to unplug one in 18 months or two years because
something better came along. I can’t do that for
everything, but I can usually afford to do it in the areas
where I have no established commercial alternative.
“I have a bunch of smart engineers, but I don’t want them spending their
day being the technology support environment for what should be a
product in our architecture.”
The data scalability challenge 15
PwC: What are you using in place of something
like Hadoop?
PwC: Of the three kinds of data, which is the
most challenging?
JP: Essentially, we use brute force. We use Ab Initio,
which is a very smart brute-force parallelization scheme.
I depend on certain capabilities in Ab Initio to parallelize
the ETL [extract, transform, and load] in such a way that
I can throw more cores at the problem.
JP: We have two kinds of challenges. The first is driven
purely by the scale at which we operate. We add
roughly half a terabyte of data per month to the credit
file. Everything we do has challenges related to scale,
updates, speed, or database performance. The vendors
both love us and hate us. But we are where the industry
is going—where everybody is going to be in two to five
years. We are a good leading indicator, but we break
their stuff all the time. A second challenge is the
unstructured part of the data, which is increasing.
PwC: Much of the data you see is transactional. Is
it all structured data, or are you also mining text?
JP: We get essentially three kinds of data. We get
accounts receivable data from credit loan issuers. That’s
the record of what people actually spend. We get public
record data, such as bankruptcy records, court records,
and liens, which are semi-structured text. And we get
other data, which is whatever shows up, and it’s
generally hooked together around a well-understood
set of identifiers. But the cost of this data is essentially
free—we don’t pay for it. It’s also very noisy. So we
have to spend computational time figuring out whether
the data we have is right, because we must find a place
to put it in the working data sets that we build.
At TransUnion, we suck in 100 million updates a day
for the credit files. We update a big data warehouse
that contains all the credit and related data. And then
every day we generate somewhere between 1 and 20
operational data stores, which is what we actually run
the business on. Our products are joined between what
we call indicative data, the information that identifies
you as an individual; structured data, which is derived
from transactional records; and unstructured data that
is attached to the indicative. We build those products
on the fly because the data may change every day,
sometimes several times a day.
One challenge is how to accurately find the right place
to put the record. For example, we get a Joe Smith at
13 Main Street and a Joe Smith at 31 Main Street.
Are those two different Joe Smiths, or is that a typing
error? We have to figure that out 100 million times a
day using a bunch of custom pattern-matching and
probabilistic algorithms.
16
PwC: It’s more of a challenge to deal with the
unstructured stuff because it comes in various
formats and from various sources, correct?
JP: Yes. We have 83,000 data sources. Not everyone
provides us with data every day. It comes in about
4,000 formats, despite our data interchange standards.
And, to be able to process it fast enough, we must
convert all data into a single interchange format that is
the representation of what we use internally. Complex
computer science problems are associated with all
of that.
PwC: Are these the kinds of data problems that
businesses in other industries will face in three
to five years?
JP: Yes, I believe so.
PwC: What are some of the other problems you
think will become more widespread?
JP: Here are some simple practical examples. We have
8.5 petabytes of data in the total managed environment.
Once you go seriously above 100 terabytes, you must
replace the storage fabric every four or five years.
Moving 100 terabytes of data becomes a huge material
issue and takes a long time. You do get some help from
improved interconnect speed, but the arrays go as fast
as they go for reads and writes and you can’t go faster
than that. And businesses down the food chain are not
accustomed to thinking about refresh cycles that take
months to complete. Now, a refresh cycle of PCs might
take months to complete, but any one piece of it takes
only a couple of hours. When I move data from one
array to another, I’m not done until I’m done.
Additionally, I have some bugs and new
vulnerabilities to deal with.
Today, we don’t have a backup problem at TransUnion
because we do incremental forever backup. However,
we do have a restore problem. To restore a material
amount of data, which we very occasionally need to do,
takes days in some instances because the physics of
the technology we use won’t go faster than that. The
average IT department doesn’t worry about these
problems. But take the amount of data an average IT
department has under management, multiply it by a
single decimal order of magnitude, and it starts to
become a material issue.
We would like to see computationally more-efficient
compression algorithms, because my two big cost
pools are Store It and Move It. For now, I don’t have
a computational problem, but if I can’t shift the trend
line on Store It and Move It, I will have a computational
problem within a few years. To perform the
computations in useful time, I must parallelize how I
compute. Above a certain point, the parallelization
breaks because I can’t move the data further.
The data scalability challenge PwC: Cloudera [a vendor offering a Hadoop
distribution] would say bring the computation to
the data.
JP: That works only for certain kinds of data. We already
do all of that large-scale computation on a file system
basis, not on a database basis. And we spend compute
cycles to compress the data so there are fewer bits to
move, then decompress the data for computation, and
recompress it so we have fewer bits to store.
What we have discovered—because I run the fourth
largest commercial GPFS [general parallel file system,
a distributed computing file system developed by IBM]
cluster in the world—is that once you go beyond a
certain size, the parallelization management tools break.
That’s why I keep telling people that Hadoop is not what
Google runs in production. Maybe the Google guys
have solved this, but if they have, they aren’t telling
me how. n
“We would like to see
computationally more-efficient
compression algorithms,
because my two big cost
pools are Store It and Move It.”
17
Disney’s Bud Albers, Scott Thompson,
and Matt Estes outline an agile
approach that leverages open-source
and cloud technologies.
Interview conducted by Galen Gruman
and Alan Morrison
Bud Albers joined what is now the Disney Technology Shared
Services Group two years ago as executive vice president and CTO. His management
team includes Scott Thompson, vice president of architecture, and Matt Estes, principal
data architect. The Technology Shared Services Group, located in Seattle, has a heritage
dating back to the late 1990s, when Disney acquired Starwave and Infoseek.
The group supports all the Disney businesses ($38 billion in annual revenue), managing
the company’s portfolio of Web properties. These include properties for the studio, store,
and park; ESPN; ABC; and a number of local television stations in major cities.
In this interview, Albers, Thompson, and Estes discuss how they’re expanding Disney’s
Web data analysis footprint without incurring additional cost by implementing a Hadoop
cluster. Albers and team freed up budget for this cluster by virtualizing servers and
eliminating other redundancies.
PwC: Disney is such a diverse company, and yet
there clearly is lots of potential for synergies and
cross-fertilization. How do you approach these
opportunities from a data perspective?
BA: We try and understand the best way to work
with and to provide services to the consumer in the
long term. We have some businesses that are very
data intensive, and then we have some that are less
so because of their consumer audience. One of the
challenges always is how to serve both kinds of
businesses and do so in ways that make sense.
The sell-to relationships extend from the studio out
to the distribution groups and the theater chains.
If you’re selling to millions, you’re trying to understand
the different audiences and how they connect.
18
One of the things I’ve been telling my folks from a data
perspective is that you don’t send terabytes one way to
be mated with a spreadsheet on the other side, right?
We’re thinking through those kinds of pieces and trying
to figure out how we move down a path. The net is that
working with all these businesses gives us a diverse set
of requirements, as you might imagine. We’re trying to
stay ahead of where all the businesses are.
In that respect, the questions I’m asking are, how do
we get more agile, and how do we do it in a way that
handles all the data we have? We must consider all of
the new form factors being developed, all of which will
generate lots of data. A big question is, how do we
handle this data in a way that makes cost sense for the
business and provides us an increased level of agility?
We hope to do in other areas what we’ve done with
content distribution networks [CDNs]. We’ve had a
tremendous amount of success with the CDN
marketplace by standardizing, by staying in the middle
of the road and not going to Akamai proprietary
extensions, and by creating a dynamic marketplace.
If we get a new episode of Lost, we can start
streaming it, and I can be streaming 80 percent on
Akamai and 20 percent on Level 3. Then we can
decide we’re going to turn it back, and I’m going
to give 80 percent to Limelight and 20 percent to
Level 3. We can do that dynamically.
PwC: What are the other main strengths of the
Technology Shared Services Group at Disney?
BA: When I came here a couple of years ago, we had
some very good core central services. If you look at the
true definition of a cloud, we had the very early makings
of one—shared central services around registration,
for example. On Disney, on ABC, or on ESPN, if you
have an ID, it works on all the Disney properties. If you
have an ESPN ID, you can sign in to KGO in San
Francisco, and it will work. It’s all a shared registration
system. The advertising system we built is shared. The
marketing systems we built are shared—all the analytics
collection, all those things are centralized. Those things
that are common are shared among all the sites.
Those things that are brand specific are built by the
brands, and the user interface is controlled by the
brands, so each of the various divisions has a head
of engineering on the Web site who reports to me.
Our CIO worries about it from the firewall back;
I worry about it from the firewall to the living room
and the mobile device. That’s the way we split up
the world, if that makes sense.
PwC: How do you link the data requirements of
the central core with those that are unique to the
various parts of the business?
BA: It’s more art than science. The business units
must generate revenue, and we must provide the core
services. How do you strike that balance? Ownership
is a lot more negotiated on some things today. We
typically pull down most of the analytics and add things
in, and it’s a constant struggle to answer the question,
“Do we have everything?” We’re headed toward this
notion of one data element at a time, aggregate, and
queue up the aggregate. It can get a little bit crazy
because you wind up needing to pull the data in and
run it through that whole food chain, and it may or
may not have lasting value.
It may have only a temporal level of importance, and so
we’re trying to figure out how to better handle that. An
awful lot of what we do in the data collection is pull it in,
lay it out so it can be reported on, and/or push it back
into the businesses, because the Web is evolving
rapidly from a standalone thing to an integral part
of how you do business.
“It’s more art than science. The business units must generate revenue,
and we must provide the core services. How do you strike that balance?
Ownership is a lot more negotiated on some things today.”
—Bud Albers
19
PwC: Hadoop seems to suggest a feasible way
to analyze data that has only temporal
importance. How did you get to the point where
you could try something like a Hadoop cluster?
BA: Guys like me never get called when it’s all pretty
and shiny. The Disney unit I joined obviously has many
strengths, but when I was brought on, there was a cost
growth situation. The volume of the aggregate activity
growth was 17 percent. Our server growth at the time
was 30 percent. So we were filling up data centers, but
we were filling them with CPUs that weren’t being used.
My question was, how can you go to the CFO and ask
for a lot of money to fill a data center with capital assets
that you’re going to use only 5 percent of?
CPU utilization isn’t the only measure, but it’s the most
prominent one. To study and understand what was
happening, we put monitors and measures on our
servers and reported peak CPU utilization on fiveminute intervals across our server farm. We found that
on roughly 80 percent of our servers, we never got
above 10 percent utilization in a monthly period.
Our first step to address that problem was virtualization.
At this point, about 49 percent of our data center is
virtual. Our virtualization effort had a sizable impact on
cost. Dollars fell out because we quit building data
centers and doing all kinds of redundant shuffling. We
didn’t have to lay off people. We changed some of our
processes, and we were able to shift our growth curve
from plus 27 to minus 3 on the shared service.
We call this our D-Cloud effort. Another step in this
effort was moving to a REST [REpresentational State
Transfer] and JSON [JavaScript Object Notation] data
exchange standard, because we knew we had to hit all
these different devices and create some common APIs
[application programming interfaces] in the framework.
One of the very first things we put in place was a central
logging service for all the events. These event logs can
be streamed into one very large data set. We can then
use the Hadoop and MapReduce paradigm to go after
that data.
20
PwC: How does the central logging service fit
into your overall strategy?
ST: As we looked at it, we said, it’s not just about
virtualization. To be able to burst and do these other
things, you need to build a bunch of core services.
The initiative we’re working on now is to build some of
those core services around managing configuration.
This project takes the foundation we laid with
virtualization and a REST and JSON data exchange
standard, and adds those core services that enable us
to respond to the marketplace as it develops. Piping
that data back to a central repository helps you to
analyze it, understand what’s going on, and make
better decisions on the basis of what you learned.
PwC: How do you evolve so that the data
strategy is really served well, so that it’s more
of a data-driven approach in some ways?
ME: On one side, you have a very transactional OLTP
[online transactional processing] kind of world, RDBMSs
[relational database management systems], and major
vendors that we’re using there. On the other side of it,
you have traditional analytical warehousing. And where
we’ve slotted this [Hadoop-style data] is in the middle
with the other operational data. Some of it is derived
from transactional data, and some has been crafted
out of analytical data. There’s a freedom that’s derived
from blending these two kinds of data.
Our centralized logging service is an example.
As we look at continuing to drive down costs to
drive up efficiency, we can begin to log a large
amount of this data at a price point that we have
not been able to achieve by scaling up RDBMSs
or using warehousing appliances.
Then the key will be putting an expert system in place.
That will give us the ability to really understand what’s
going on in the actual operational environment.
We’re starting to move again toward lower utilization
trajectories. We need to scale the infrastructure back
and get that utilization level up to the threshold.
PwC: This kind of information doesn’t go in a
cube. Not that data cubes are going away,
but cubes are fairly well known now. The value
you can create is exactly what you said,
understanding the thinking behind it and the
exploratory steps.
ST: We think storing the unstructured data in its raw
format is what’s coming. In a Hadoop environment,
instead of bringing the data back to your warehouse,
you figure out what question you want to answer. Then
you MapReduce the input, and you may send that off
to a data cube and a place that someone can dig
around in, but you keep the data in its raw format
and pull out only what you need.
BA: The wonderful thing about where we’re headed
right now is that data analysis used to be this giant,
massive bet that you had to place up front, right?
No longer. Now, I pull Hadoop off of the Internet,
first making sure that we’re compliant from a legal
perspective with licensing and so forth. After that’s
taken care of, you begin to prototype. You begin to
work with it against common hardware. You begin
to work with it against stuff you otherwise might
throw out. Rather than, I’m going to go spend how
much for Teradata?
We’re using the basic premise of the cloud, and we’re
using those techniques of standardizing the interface
to virtualize and drive cost out. I’m taking that cost
savings and returning some of it to the business, but
then reinvesting some in new capabilities while the
cost curve is stabilizing.
ME: Refining some of this reinvestment in new
capabilities doesn’t have to be put in the category of
traditional “$5 million projects” companies used to
think about. You can make significant improvements
with reinvestments of $200,000 or even $50,000.
BA: It’s then a matter of how you’re redeploying an
investment in resources that you’ve already made as
a corporation. It’s a matter of now prioritizing your
work and not changing the bottom-line trajectory in a
negative fashion with a bet that may not pay off. I can
try it, and I don’t have to get great big governancebased permission to do it, because it’s not a bet of half
the staff and all of this stuff. It’s, OK, let’s get something
on the ground, let’s work with the business unit, let’s
pilot it, let’s go somewhere where we know we have a
need, let’s validate it against this need, and let’s make
sure that it’s working. It’s not something that must go
through an RFP [request for proposal] and standard
procurement. I can move very fast. n
“We think storing the unstructured data in its raw format is what’s
coming. In a Hadoop environment, instead of bringing the data back to
your warehouse, you figure out what question you want to answer.”
—Scott Thompson
21
How companies are using open-source cluster-computing techniques
to analyze their data.
By Alan Morrison
22
As recently as two years ago, the International
Supercomputing Conference (ISC) agenda included
nothing about distributed computing for Big Data—
as if projects such as Google Cluster Architecture, a
low-cost, distributed computing design that enables
efficient processing of large volumes of less-structured
data, didn’t exist. In a May 2008 blog, Brough Turner
noted the omission, pointing out that Google had
harnessed as much as 100 petaflops1 of computing
power, compared to a mere 1 petaflop in the new IBM
Roadrunner, a supercomputer profiled in EE Times
that month. “Have the supercomputer folks been
bypassed and don’t even know it?” Turner wondered.2
Turner, co-founder and CTO of Ashtonbrooke.com, a
startup in stealth mode, had been reading Google’s
research papers and remarking on them in his blog for
years. Although the broader business community had
taken little notice, some companies were following in
Google’s wake. Many of them were Web companies
that had data processing scalability challenges similar
to Google’s.
Yahoo, for example, abandoned its own data
architecture and began to adopt one along the lines
pioneered by Google. It moved to Apache Hadoop,
an open-source, Java-based distributed file system
based on Google File System and developed by
the Apache Software Foundation; it also adopted
MapReduce, Google’s parallel programming framework.
Yahoo used these and other open-source tools it helped
develop to crawl and index the Web. After implementing
the architecture, it found other uses for the technology
and has now scaled its Hadoop cluster to 4,000 nodes.
By early 2010, Hadoop, MapReduce, and related
open-source techniques had become the driving forces
behind what O’Reilly Media, The Economist, and others
in the press call Big Data and what vendors call cloud
storage. Big Data refers to data sets that are growing
exponentially and that are too large, too raw, or too
unstructured for analysis by traditional means. Many
who are familiar with these new methods are convinced
that Hadoop clusters will enable cost-effective analysis
of Big Data, and these methods are now spreading
beyond companies that mine the public Web as part
of their business.
By early 2010, Hadoop, MapReduce, and related open-source
techniques had become the driving forces behind what O’Reilly Media,
The Economist, and others in the press call Big Data and what
vendors call cloud storage.
23
“Hadoop will process the data set and output a new data set,
as opposed to changing the data set in place.” —Amr Awadallah
of Cloudera
What are these methods and how do they work? This
article looks at the architecture and tools surrounding
Hadoop clusters with an eye toward what about them
will be useful to mainstream enterprises during the
next three to five years. We focus on their utility for
less-structured data.
Hadoop clusters
Although cluster computing has been around for
decades, commodity clusters are more recent, starting
with UNIX- and Linux-based Beowulf clusters in the
mid-1990s. These banks of inexpensive servers
networked together were pitted against expensive
supercomputers from companies such as Cray and
others—the kind of computers that government
agencies, such as the National Aeronautics and Space
Administration (NASA), bought. It was no accident
that NASA pioneered the development of Beowulf.3
Hadoop extends the value of commodity clusters,
making it possible to assemble a high-end computing
cluster at a low-end price. A central assumption
underlying this architecture is that some nodes are
bound to fail when computing jobs are distributed
across hundreds or thousands of nodes. Therefore,
one key to success is to design the architecture to
anticipate and recover from individual node failures.4
Other goals of the Google Cluster Architecture and
its expression in open-source Hadoop include:
• Price/performance over peak performance—The
emphasis is on optimizing aggregate throughput; for
example, sorting functions to rank the occurrence of
keywords in Web pages. Overall sorting throughput
is high. In each of the past three years, Yahoo’s
Hadoop clusters have won Gray’s terabyte sort
benchmarking test.5
24
• Software tolerance for hardware failures—When a
failure occurs, the system responds by transferring
the processing to another node, a critical capability
for large distributed systems. As Roger Magoulas,
research director for O’Reilly Media, says, “If you are
going to have 40 or 100 machines, you don’t expect
your machines to break. If you are running something
with 1,000 nodes, stuff is going to break all the time.”
• High compute power per query—The ability to
scale up to thousands of nodes implies the ability
to throw more compute power at each query. That
ability, in turn, makes it possible to bring more data
to bear on each problem.
• Modularity and extensibility—Hadoop clusters
scale horizontally with the help of a uniform, highly
modular architecture.
Hadoop isn’t intended for all kinds of workloads,
especially not those with many writes. It works best for
read-intensive workloads. These clusters complement,
rather than replace, high-performance computing (HPC)
and other relational data systems. They don’t work well
with transactional data or records that require frequent
updating. “Hadoop will process the data set and output
a new data set, as opposed to changing the data set in
place,” says Amr Awadallah, vice president of
engineering and CTO of Cloudera, which develops a
version of Hadoop.
A data architecture and a software design that are
frugal with network and disk resources are responsible
for the price/performance ratio of Hadoop clusters.
In Awadallah’s words, “You move your processing to
where your data lives.” Each node has its own
processing and storage, and the data is divided and
processed locally in blocks sized for the purpose.
This concept of localization makes it possible to use
inexpensive serial advanced technology attachment
(SATA) hard disks—the kind used in most PCs and
servers—and Gigabit Ethernet for most network
interconnections. (See Figure 1.)
Client
Switch
1000Mbps
Switch
100Mbps
Switch
100Mbps
Typical node setup
2 quad-core Intel Nehalem
24GB of RAM
Task tracker/
DataNode
JobTracker
Task tracker/
DataNode
NameNode
Task tracker/
DataNode
Task tracker/
DataNode
Effective file space per node: 20TB
Task tracker/
DataNode
Task tracker/
DataNode
Claimed benefits
Task tracker/
DataNode
Task tracker/
DataNode
Task tracker/
DataNode
Task tracker/
DataNode
Rack
12 1TB SATA disks (non-RAID)
1 Gigabit Ethernet card
Cost per node: $5,000
Rack
Linear scaling at $250 per user TB
(versus $5,000–$100,000 for alternatives)
Compute placed near the data and
fewer writes limit networking
and storage costs
Modularity and extensibility
Figure 1: Hadoop cluster layout and characteristics
Source: IBM, 2008, and Cloudera, 2010
25
“Amazon supports Hadoop directly through its Elastic MapReduce
application programming interfaces.” —Chris Wensel of Concurrent
The result is less-expensive large-scale distributed
computing and parallel processing, which make
possible an analysis that is different from what most
enterprises have previously attempted. As author Tom
White points out, “The ability to run an ad hoc query
against your whole data set and get the results in a
reasonable time is transformative.”6
The cost of this capability is low enough that companies
can fund a Hadoop cluster from existing IT budgets.
When it decided to try Hadoop, Disney’s Technology
Shared Services Group took advantage of the increased
server utilization it had already achieved from
virtualization. As of March 2010, with nearly 50 percent
of its servers virtualized, Disney had 30 percent server
image growth annually but 30 percent less growth in
physical servers. It was then able to set up a multiterabyte cluster with Hadoop and other free opensource tools, using servers it had planned to retire. The
group estimates it spent less than $500,000 on the
entire project. (See the article, “Tapping into the power
of Big Data,” on page 04.)
These clusters are also transformative because cloud
providers can offer them on demand. Instead of using
their own infrastructures, companies can subscribe to
a service such as Amazon’s or Cloudera’s distribution
on the Amazon Elastic Compute Cloud (EC2) platform.
The EC2 platform was crucial in a well-known use of
cloud computing on a Big Data project that also
depended on Hadoop and other open-source tools.
In 2007, The New York Times needed to quickly
assemble the PDFs of 11 million articles from
4 terabytes of scanned images. Amazon’s EC2 service
completed the job in 24 hours after setup, a feat
that received widespread attention in blogs and the
trade press.
26
Mostly overlooked in all that attention was the use of
the Hadoop Distributed File System (HDFS) and the
MapReduce framework. Using these open-source tools,
after studying how-to blog posts from others, Times
senior software architect Derek Gottfrid developed and
ran code in parallel across multiple Amazon machines.7
“Amazon supports Hadoop directly through its Elastic
MapReduce application programming interfaces [APIs],”
says Chris Wensel, founder of Concurrent, which
developed Cascading. (See the discussion of
Cascading later in this article.) “I regularly work with
customers to boot up 200-node clusters and process
3 terabytes of data in five or six hours, and then shut
the whole thing down. That’s extraordinarily powerful.”
The Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) and the
MapReduce parallel programming framework are at
the core of Apache Hadoop. Comparing HDFS and
MapReduce to Linux, Awadallah says that together
they’re a “data operating system.” This description may
be overstated, but there are similarities to any operating
system. Operating systems schedule tasks, allocate
resources, and manage files and data flows to fulfill the
tasks. HDFS does a distributed computing version of
this. “It takes care of linking all the nodes together to
look like one big file and job scheduling system for the
applications running on top of it,” Awadallah says.
HDFS, like all Hadoop tools, is Java based. An HDFS
contains two kinds of nodes:
• A single NameNode that logs and maintains the
necessary metadata in memory for distributed jobs
• Multiple DataNodes that create, manage, and
process the 64MB blocks that contain pieces of
Hadoop jobs, according to the instructions from
the NameNode
HDFS uses multi-gigabyte file sizes to reduce the
management complexity of lots of files in large data
volumes. It typically writes each copy of the data once,
adding to files sequentially. This approach simplifies
the task of synchronizing data and reduces disk and
bandwidth usage.
HDFS does not perform tasks such as changing
specific numbers in a list or other changes on parts
of a database. This limitation leads some to assume
that HDFS is not suitable for structured data. “HDFS
was never designed for structured data and therefore
it’s not optimal to perform queries on structured data,”
says Daniel Abadi, assistant professor of computer
science at Yale University. Abadi and others at Yale
have done performance testing on the subject, and they
have created a relational database alternative to HDFS
called HadoopDB to address the performance issues
they identified.8
Equally important are fault tolerance within the same
disk and bandwidth usage limits. To accomplish fault
tolerance, HDFS creates three copies of each data
block, typically storing two copies in the same rack.
The system goes to another rack only if it needs the
third copy. Figure 2 shows a simplified depiction of
HDFS and its data block copying method.
Some developers are structuring data in ways that are
suitable for HDFS; they’re just doing it differently from
the way relational data would be structured. Nathan
Marz, a lead engineer at BackType, a company that
offers a search engine for social media buzz, uses
schemas to ensure consistency and avoid data
corruption. “A lot of people think that Hadoop is
meant for unstructured data, like log files,” Marz says.
“While Hadoop is great for log files, it’s also fantastic
for strongly typed, structured data.” For this purpose,
Marz uses Thrift, which was developed by Facebook
for data translation and serialization purposes.9 (See
the discussion of Thrift later in this article.) Figure 3
illustrates a typical Hadoop data processing flow that
includes Thrift and MapReduce.
Client
NameNode
(metadata)
Files
File A
File A
Blocks
1, 2, 4
3, 5
DataNode
DataNode
DataNode
DataNode
1
5
4
5
2
4
2
3
3
1
2
5
Figure 2: The Hadoop Distributed File System, or HDFS
Source: Apache Software Foundation, IBM, and PricewaterhouseCoopers, 2008
Input
data
Input
applications
Less-structured
information
such as:
log files
messages
images
Cascading
Thrift
Zookeeper
Pig
Output
applications
Core Hadoop data processing
Mashups
RDBMS apps
BI systems
1
Jobs
1
M
1
2
M
2
2
3
M
R
Results
3
3
M
Map
R
Reduce
64MB blocks
Figure 3: Hadoop ecosystem overview
Source: PricewaterhouseCoopers, derived from Apache Software Foundation and Dion Hinchcliffe, 2010
27
MapReduce
MapReduce is the base programming framework for
Hadoop. It often acts as a bridge between HDFS and
tools that are more accessible to most programmers.
According to those at Google who developed the tool,
“it hides the details of parallelization” and the other
nuts and bolts of HDFS.10
MapReduce is a layer of abstraction, a way of managing
a sea of details by creating a layer that captures and
summarizes their essence. That doesn’t mean it is easy
to use. Many developers choose to work with another
tool, yet another layer of abstraction on top of it. “I
avoid using MapReduce directly at all cost,” Marz says.
“I actually do almost all my MapReduce work with a
library called Cascading.”
The terms “map” and “reduce” refer to steps the
tool takes to distribute, or map, the input for parallel
processing, and then reduce, or aggregate, the
processed data into output files. (See Figure 4.)
MapReduce works with key-value pairs. Frequently
with Web data, the keys consist of URLs and the
values consist of Web page content, such as
Hypertext Markup Language (HTML).
MapReduce’s main value is as a platform with a set
of APIs. Before MapReduce, fewer programmers could
take advantage of distributed computing. Now that
user-accessible tools have been designed, simpler
programming is possible on massively parallel systems
and less adaptation of the programs is required.
The following sections examine some of these tools.
Data
store 1
Data
store n
Input key-value pairs
Input key-value pairs
Map
key 1 values
Map
key 2 values
Barrier ...
key 1 values
key 3 values
key 2 values
key 3 values
Aggregates intermediate values by output key ... Barrier
key 1 intermediate values
Reduce
Reduce
Reduce
final key 1 values
final key 2 values
final key 3 values
Figure 4: MapReduce phases
Source: Google, 2004, and Cloudera, 2009
28
“You can code in whatever JVM-based language you want, and then
shove that into the cluster.” —Chris Wensel of Concurrent
Cascading
Wensel, who created Cascading, calls it an alternative
API to MapReduce, a single library of operations that
developers can tap. It’s another layer of abstraction
that helps bring what programmers ordinarily do in
non-distributed environments to distributed computing.
With it, he says, “you can code in whatever JVM-based
[Java Virtual Machine] language you want, and then
shove that into the cluster.”
Wensel wanted to obviate the need for “thinking in
MapReduce.” When using Cascading, developers don’t
think in key-value pair terms—they think in terms of
fields and lists of values called “tuples.” A Cascading
tuple is simpler than a database record but acts like
one. Each tuple flows through “pipe” assemblies, which
are comparable to Java classes. The data flow begins
at the source, an input file, and ends with a sink, an
output directory. (See Figure 5.)
Map
Reduce
[f1, f2, ...]
P
Map
[f1, f2, ...]
P
[f1, f2, ...]
So
Assembly
Flow
A
A
A
A
A
A
A
A
MR
MR
MR
MR
Cluster
Job
Job
Reduce
[f1, f2, ...]
P
Rather than approach map and reduce phases large-file
by large-file, developers assemble flows of operations
using functions, filters, aggregators, and buffers.
Those flows make up the pipe assemblies, which, in
Marz’s terms, “compile to MapReduce.” In this way,
Cascading smoothes the bumpy MapReduce terrain so
more developers—including those who work mainly in
Client
scripting
languages—can build flows. (See Figure 6.)
[f1, f2, ...]
P
A
[f1, f2, ...]
P
P
MR
[f1, f2, ...]
Pipe assembly
Hadoop MR (translation to MapReduce)
MapReduce jobs
Si
Figure 6: Cascading assembly and flow
[f1, f2, ...]
So
Si
P
Tuples with field names
Source
Sink
Pipe
Source: Concurrent, 2010
Figure 5: A Cascading assembly
Source: Concurrent, 2010
29
Some useful tools for MapReduce-style
analytics programming
Open-source tools that work via MapReduce on
Hadoop clusters are proliferating. Users and developers
don’t seem concerned that Google received a patent for
MapReduce in January 2010. In fact, Google, IBM, and
others have encouraged the development and use of
open-source versions of these tools at various research
universities.11 A few of the more prominent tools
relevant to analytics, and used by developers we’ve
interviewed, are listed in the sections that follow.
With LISP, Watson says, he can load the data once and
test multiple times. In C++, he would need to use a
relational database and reload each time for a program
test. Using LISP makes it possible to create and test
small bits of code in an iterative fashion, a major reason
for the productivity gains.�
This iterative, LISP-like program-programmer
interaction with Clojure leads to what Hickey calls
“dynamic development.” Any code entered in the
console interface, he points out, is automatically
compiled on the fly.�
Clojure
Thrift
Clojure creator Rich Hickey wanted to combine aspects
of C or C#, LISP (for list processing, a language
associated with artificial intelligence that’s rich in
mathematical functions), and Java. The letters C, L, and
J led him to name the language, which is pronounced
“closure.” Clojure combines a LISP library with Java
libraries. Clojure’s mathematical and natural language
processing (NLP) capabilities and the fact that it is JVM
based make it useful for statistical analysis on Hadoop
clusters. FlightCaster, a commercial-airline-delayprediction service, uses Clojure on top of Cascading,
on top of MapReduce and Hadoop, for “getting the
right view into unstructured data from heterogeneous
sources,” says Bradford Cross, FlightCaster co-founder.�
Thrift, initially created at Facebook in 2007 and then
released to open source, helps developers create
services that communicate across languages, including
C++, C#, Java, Perl, Python, PHP, Erlang, and Ruby.
With Thrift, according to Facebook, users can “define
all the necessary data structures and interfaces for a
complex service in a single short file.”�
LISP has attributes that lend themselves to NLP, making
Clojure especially useful in NLP applications. Mark
Watson, an artificial intelligence consultant and author,
says most LISP programming he’s done is for NLP. He
considers LISP to be four times as productive for
programming as C++ and twice as productive as Java.
His NLP code “uses a huge amount of memory-resident
data,” such as lists of proper nouns, text categories,
common last names, and nationalities.
“Getting the right view into
unstructured data from
heterogeneous sources can
be quite tricky.” —Bradford
Cross of FlightCaster
30
A more important aspect of Thrift, according to
BackType’s Marz, is its ability to create strongly typed
data and flexible schemas. Countering the emphasis of
the so-called NoSQL community on schema-less data,
Marz asserts there are effective ways to lightly structure
the data in Hadoop-style analysis.
Marz uses Thrift’s serialization features, which turn
objects�into a sequence of bits that can be stored as
files, to create schemas between types (for instance,
differentiating between text strings and long, 64-bit
integers) and schemas between relationships (for
instance, linking Twitter accounts that share a
common interest). Structuring the data in this way
helps BackType avoid inconsistencies in the data
or the need to manually filter for some attributes.
BackType can use required and optional fields to
structure the Twitter messages it crawls and analyzes.
The required fields can help enforce data type. The
optional fields, meanwhile, allow changes to the
schema as well as the use of old objects that were
created using the old schema.
Marz’s use of Thrift to model social graphs like the one
in Figure 7 demonstrates the flexibility of the schema
for Hadoop-style computing. Thrift essentially enables
modularity in the social graph described in the schema.
For example, to select a single age for each person,
BackType can take into account all the raw age data.
It can do this by a computation on the entire data set
or a selective computation on only the people in the
data set who have new data.
Bob
Gender male
Age
Charlie
Gender female
Gender male
25
Age
Apache
Thrift
Non-relational data stores have become much more
numerous since the Apache Hadoop project began in
2007. Many are open source. Developers of these data
stores have optimized each for a different kind of data.
When contrasted with relational databases, these data
stores lack many design features that can be essential
for enterprise transactional data. However, they are
often well tailored to specific, intended purposes,
and they offer the added benefit of simplicity. Primary
non-relational data store types include the following:
• Multidimensional map store—Each record
maps a row name, a column name, and a time
stamp to a value. Map stores have their heritage
in Google’s Bigtable.
39
Alice
Age
Open-source, non-relational data stores
22
Language: C++
Figure 7: An example of a social graph modeled using Thrift schema
Source: Nathan Marz, 2010
BackType doesn’t just work with raw data. It runs a
series of jobs that constantly normalize and analyze
new data coming in, and then other jobs that write the
analyzed data to a scalable random-access database
such as HBase or Cassandra.12
• Key-value store—Each record consists of a key,
or unique identifier, mapped to one or more values.
• Graph store—Each record consists of elements that
together form a graph. Graphs depict relationships.
For example, social graphs describe relationships
between people. Other graphs describe relationships
between objects, between links, or both.
• Document store—Each record consists of a
document. Extensible Markup Language (XML)
databases, for example, store XML documents.
Because of their simplicity, map and key-value stores
can have scalability advantages over most types of
relational databases. (HadoopDB, a hybrid approach
developed at Yale University, is designed to overcome
the scalability problems associated with relational
databases.) Table 1 provides a few examples of the
open-source, non-relational data stores that
are available.
Map
Key-value
Document
Graph
HBase
Tokyo Cabinet/Tyrant
MongoDB
Resource Description Framework (RDF)
Hypertable
Project Voldemort
CouchDB
Neo4j
Cassandra
Redis
Xindice
InfoGrid
Table 1: Example open-source, non-relational data stores
Source: PricewaterhouseCoopers, Daniel Abadi of Yale University, and organization Web sites, 2010
31
“We established that Hadoop does horizontally scale. This is what’s really
exciting, because I’m an RDBMS guy, right? I’ve done that for years, and
you don’t get that kind of constant scalability no matter what you do.”
—Scott Thompson of Disney
Other related technologies and vendors
A comprehsensive review of the various tools created
for the Hadoop ecosystem is beyond the scope of this
article, but a few of the tools merit brief description here
because they’ve been mentioned elsewhere in this issue:
• Pig—A scripting language called Pig Latin, which is
a primary feature of Apache Pig, allows more concise
querying of data sets “directly from the console” than
is possible using MapReduce, according to author
Tom White.
• Hive—Hive is designed as “mainly an ETL [extract,
transform, and load] system” for use at Facebook,
according to Chris Wensel.
• Zookeeper—Zookeeper provides an interface
for creating distributed applications, according
to Apache.
Big Data covers many vendor niches, and some
vendors’ products take advantage of the Hadoop stack
or add to its capabilities. (See the sidebar “Selected Big
Data tool vendors.”)
Conclusion
Interest in and adoption of Hadoop clusters are
growing rapidly. Reasons for Hadoop’s
popularity include:
• Open, dynamic development—The Hadoop/
MapReduce environment offers cost-effective
distributed computing to a community of opensource programmers who’ve grown up on Linux
and Java, and scripting languages such as
Perl and Python. Some are taking advantage of
functional programming language dialects such
as Clojure. The openness and interaction can
lead to faster development cycles.
32
• Cost-effective scalability—Horizontal scaling from
a low-cost base implies a feasible long-term cost
structure for more kinds of data. Scott Thompson,
vice president of infrastructure at the Disney
Technology Shared Services Group, says, “We
established that Hadoop does horizontally scale.
This is what’s really exciting, because I’m an
RDBMS guy, right? I’ve done that for years, and
you don’t get that kind of constant scalability no
matter what you do.”
• Fault tolerance—Associated with scalability is
the assumption that some nodes will fail. Hadoop
and MapReduce are fault tolerant, another reason
commodity hardware can be used.
• Suitability for less-structured data—Perhaps
most importantly, the methods that Google
pioneered, and that Yahoo and others expanded,
focus on what Cloudera’s Awadallah calls
“complex” data. Although developers such as Marz
understand the value of structuring data, most
Hadoop/MapReduce developers don’t have an
RDBMS mentality. They have an NLP mentality, and
they’re focused on techniques optimized for large
amounts of less-structured information, such as the
vast amount of information on the Web.
The methods, cost advantages, and scalability of
Hadoop-style cluster computing clear a path for
enterprises to analyze the Big Data they didn’t have
the means to analyze before. This set of methods is
separate from, yet complements, data warehousing.
Understanding what Hadoop clusters do and how
they do it is fundamental to deciding when and where
enterprises should consider making use of them.
Selected Big Data tool vendors
Amazon
Amazon provides a Hadoop framework on its
Elastic Compute Cloud (EC2) and S3 storage
service it calls Elastic MapReduce.
Appistry
Appistry’s CloudIQ Storage platform offers a
substitute for HDFS, one designed to eliminate
the single point of failure of the NameNode.
Cloudera
Cloudera takes a Red Hat approach to Hadoop,
offering its own distribution on EC2/S3 with
management tools, training, support, and
professional services.
Cloudscale
Cloudscale’s first product, Cloudcel, marries an
Excel-based interface to a back end that’s a
massively parallel stream processing engine. The
product is designed to process stored, historical,
or real-time data.
Concurrent
Concurrent developed Cascading, for which
it offers licensing, training, and support.
Drawn to Scale
Drawn to Scale offers an analytical and
transactional database product on Hadoop
and HBase, with occasional consulting.
IBM
IBM introduced a distribution of Hadoop called
BigInsights in May 2010. The company’s jStart
team offers briefings and workshops on Hadoop
pilots. IBM BigSheets acts as an aggregation,
analysis, and visualization point for large amounts
of Web data.
1 FLOPS stands for “floating point operations per second.” Floating point
processors use more bits to store each value, allowing more precision
and ease of programming than fixed point processors. One petaflop is
upwards of one quadrillion floating point operations per second.
2 Brough Turner, “Google Surpasses Supercomputer Community,
Unnoticed?” May 20, 2008, http://blogs.broughturner.com/
communications/2008/05/google-surpasses-supercomputercommunity-unnoticed.html (accessed April 8, 2010).
3 See, for example, Tim Kientzle, “Beowulf: Linux clustering,”
Dr. Dobb’s Journal, November 1, 1998, Factiva Document
dobb000020010916dub100045 (accessed April 9, 2010).
4 Luis Barroso, Jeffrey Dean, and Urs Hoelzle, “Web Search for a
Planet: The Google Cluster Architecture,” Google Research
Publications, http://research.google.com/archive/googlecluster.html
(accessed April 10, 2010).
5 See http://sortbenchmark.org/ and http://developer.yahoo.net/blog/
6 Tom White, Hadoop: The Definitive Guide (Sebastopol, CA: O’Reilly
Media, 2009), 4.
7 See Derek Gottfrid, “Self-service, Prorated Super Computing Fun!”
The New York Times Open Blog, November 1, 2007, http://open.
blogs.nytimes.com/2007/11/01/self-service-prorated-supercomputing-fun/(accessed June 4, 2010) and Bill Snyder, “Cloud
Computing: Not Just Pie in the Sky,” CIO, March 5, 2008, Factiva
Document CIO0000020080402e4350000 (accessed March 28, 2010).
8 See “HadoopDB” at http://db.cs.yale.edu/hadoopdb/hadoopdb.html
9 Nathan Marz, “Thrift + Graphs = Strong, flexible schemas on
Hadoop,” http://nathanmarz.com/blog/schemas-on-hadoop/
10 Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data
Processing on Large Clusters,” Google Research Publications,
December 2004, http://labs.google.com/papers/mapreduce.html
11 See Dean, et al., US Patent No. 7,650,331, January 19, 2010, at http://
www.uspto.gov. For an example of the participation by Google and
IBM in Hadoop’s development, see “Google and IBM Announce
University Initiative to Address Internet-Scale Computing Challenges,”
Google press release, October 8, 2007, http://www.google.com/intl/en/
press/pressrel/20071008_ibm_univ.html (accessed March 28, 2010).
12 See the Apache site at http://apache.org/ for descriptions of many
tools that take advantage of MapReduce and/or HDFS that are not
profiled in this article.
Microsoft
Microsoft Pivot uses the company’s Deep Zoom
technology to provide visual data browsing
capabilities for XML files. Azure Table services is
in some ways comparable to Bigtable or HBase.
(See the interview with Mark Taylor and Ray Velez
of Razorfish on page 46.)
ParaScale
ParaScale offers software for enterprises to
set up their own public or private cloud storage
environments with parallel processing and
large-scale data handling capability.
33
Cloudera’s Amr Awadallah discusses how and why
diverse companies are trying this novel approach.
Interview conducted by Alan Morrison, Bo Parker, and Vinod Baya
Amr Awadallah is vice president of engineering and CTO at Cloudera, a company that
offers products and services around Hadoop, an open-source technology that allows
efficient mining of large, complex data sets. In this interview, Awadallah provides an
overview of Hadoop’s capabilities and how Cloudera customers are using them.
PwC: Were you at Yahoo before coming
to Cloudera?
AA: Yes. I was with Yahoo from mid-2000 until mid2008, starting with the Yahoo Shopping team after
selling my company VivaSmart to Yahoo. Beginning in
2003, my career shifted toward business intelligence
and analytics at consumer-facing properties such as
Yahoo News, Mail, Finance, Messenger, and Search.
I had the daunting task of building a very large data
warehouse infrastructure that covered all these diverse
products and figuring out how to bring them together.
That is when I first experienced Hadoop. Its model of
“mine first, govern later” fits in with the well-governed
infrastructure of a data mart, so it complements these
systems very well. Governance standards are important
for maintaining a common language across the
organization. However, they do inhibit agility, so it’s best
to complement a well-governed data mart with a more
agile complex data processing system like Hadoop.
PwC: How did Yahoo start using Hadoop?
AA: In 2005, Yahoo was faced with a business
challenge. The cost of creating the Web search index
was approaching the revenues being made from the
keyword advertising on the search pages. Yahoo Search
adopted Hadoop as an economically scalable solution,
34
and worked on it in conjunction with the open-source
Apache Hadoop community. Yahoo played a very big
role in the evolution of Hadoop to where it is today.
Soon after the Yahoo Search team started using
Hadoop, other parts of the company began to see
the power and flexibility that this system offers.
Today, Yahoo uses Hadoop for data warehousing,
mail spam detection, news feed processing, and
content/ad targeting.
PwC: What are some of the advantages of
Hadoop when you compare it with RDBMSs
[relational database management systems]?
AA: With Oracle, Teradata, and other RDBMSs, you
must create the table and schema first. You say, this is
what I’m going to be loading in, these are the types of
columns I’m going to load in, and then you load your
data. That process can inhibit how fast you can evolve
your data model and schemas, and it can limit what you
log and track.
With Hadoop, it’s the other way around. You load all of
your data, such as XML [Extensible Markup Language],
tab delimited flat files, Apache log files, JSON
[JavaScript Object Notation], etc. Then in Hive or Pig
[both of which are Hadoop data query tools], you point
your metadata toward the file and parse the data on
“We are not talking about a replacement technology for data warehouses—
let’s be clear on this. No customers are using Hadoop in that fashion.”
the fly when reading it out. This approach lets you
extract the columns that map to the data structure
you’re interested in.
Creating the structure on the read path like this can
have its disadvantages; however, it gives you the agility
and the flexibility to evolve your schema much quicker
without normalizing your data first. In general, relational
systems are not well suited for quickly evolving complex
data types.
Another benefit is retroactive schemas. For example,
an engineer launching a new product feature can add
the logging for it, and that new data will start flowing
directly into Hadoop. Weeks or months later, a data
analyst can update their read schema on how to parse
this new data. Then they will immediately be able to
query the history of this metric since it started flowing
in [as opposed to waiting for the RDBMS schema to
be updated and the ETL (extract, transform, and load)
processes to reload the full history of that metric].
PwC: What about the cost advantages?
AA: The cost basis is 10 to 100 times cheaper than
other solutions. But it’s not just about cost. Relational
databases are really good at what they were designed
for, which is running interactive SQL queries against
well-structured data. We are not talking about a
replacement technology for data warehouses—let’s
be clear on this.
No customers are using Hadoop in that fashion. They
recognize that the nature of data is changing. Where is
the data growing? It’s growing around complex data
types. Is a relational container the best and most
interesting place to ask questions of complex plus
relational data? Probably not, although organizations
still need to use, collect, and present relational data
for questions that are routine and require, in some
cases, a real-time response.
PwC: How have companies benefited
from querying across both structured and
complex data?
AA: When you query against complex data types, such
as Web log files and customer support forums, as well
as against the structured data you have already been
collecting, such as customer records, sales history, and
transactions, you get a much more accurate answer to
the question you’re asking. For example, a large credit
card company we’ve worked with can identify which
transactions are most likely fraudulent and can prioritize
which accounts need to be addressed.
PwC: Are the companies you work with aware
that this is a totally different paradigm?
AA: Yes and no. The main use case we see is in
companies that have a mix of complex data and
structured data that they want to query across. Some
large financial institutions that we talk to have 10, 20,
or even hundreds of Oracle systems—it’s amazing.
They have all of these file servers storing XML files or
log files, and they want to consolidate all these tables
and files onto one platform that can handle both data
types so they can run comprehensive queries. This is
where Hadoop really shines; it allows companies to
run jobs across both data types. n
35
Start by adopting a fresh mind-set, grooming the right talent,
and piloting new tools to ride the next wave of innovation.
By Jimmy Guterman
36
Like pioneers exploring a new territory, a few
enterprises are making discoveries by exploring Big
Data. The terrain is complex and far less structured
than the data CIOs are accustomed to. And it is
growing by exabytes each year. But it is also getting
easier and less expensive to explore and analyze, in
part because software tools built to take advantage
of cloud computing infrastructures are now available.
Our advice to CIOs: You don’t need to rush, but do
begin to acquire the necessary mind-set, skill set,
and tool kit.
These are still the early days. The prime directive
for any CIO is to deliver value to the business through
technology. One way to do that is to integrate new
technologies in moderation, with a focus on the
long-term opportunities they may yield. Leading
CIOs pride themselves on waiting until a technology
has proven value before they adopt it. Fair enough.
However, CIOs who ignore the Big Data trends
described in the first two articles risk being
marginalized in the C-suite. As they did with earlier
technologies, including traditional business
intelligence, business unit executives are ready to
seize the Big Data opportunity and make it their own.
This will be good for their units and their careers,
but it would be better for the organization as a whole
if someone—the CIO is the natural person—drove a
single, central, cross-enterprise Big Data initiative.
With this in mind, PricewaterhouseCoopers
encourages CIOs to take these steps:
• Start to add the discipline and skill set for Big Data
to your organizations; the people for this may or
may not come from existing staff.
• Set up sandboxes (which you can rent or buy) to
experiment with Big Data technologies.
• Understand the open-source nature of the tools
and how to manage risk.
Enterprises have the opportunity to analyze more
kinds of data more cheaply than ever before. It is
also important to remember that Big Data tools did
not originate with vendors that were simply trying to
create new markets. The tools sprung from a real
need among the enterprises that first confronted
the scalability and cost challenges of Big Data—
challenges that are now felt more broadly. These
pioneers also discovered the need for a wider variety
of talent than IT has typically recruited.
Enterprises have the opportunity to analyze more kinds of data
more cheaply than ever before. It is also important to remember
that Big Data tools did not originate with vendors that were simply
trying to create new markets.
37
Big Data lessons from Web companies
Today’s CIO literature is full of lessons you can learn
from companies such as Google. Some of the
comparisons are superficial because most companies
do not have a Web company’s data complexities and
will never attain the original singleness of purpose
that drove Google, for example, to develop Big
Data innovations. But there is no niche where the
development of Big Data tools, techniques, mind-set,
and usage is greater than in companies such as
Google, Yahoo, Facebook, Twitter, and LinkedIn.
And there is plenty that CIOs can learn from these
companies. Every major service these companies
create is built on the idea of extracting more and
more value from more and more data.
For example, the 1-800-GOOG-411 service, which
individuals can call to get telephone numbers and
addresses of local businesses, does not merely take
an ax to the high-margin directory assistance services
run by incumbent carriers (although it does that).
That is just a by-product. More important, the
800-number service has let Google compile what has
been described as the world’s largest database of
spoken language. Google is using that database to
improve the quality of voice recognition in Google
Voice, in its sundry mobile-phone applications, and in
other services under development. Some of the ways
companies such as Google capture data and convert
it into services are listed in Table 1.
Service
Data that Web companies capture
Self-serve advertising
Ad-clicking and -picking behavior
Analytics
Aggregated Web site usage tracking
Social networking
Sundry online
Browser
Limited browser behaviors
E-mail
Words used in e-mails
Search engine
Searches and clicking information
RSS feeds
Detailed reading habits
Extra browser functionality
All browser behavior
View videos
All site behavior
Free directory assistance
Database of spoken words
Table 1: Web portal Big Data strategy
38
“I see inspiration from the Google model ... just having lots of
cheap stuff that you can use to crunch vast quantities of data.”
—Phil Buckle of the UK National Policing Improvement Agency
Many Web companies are finding opportunity in “gray
data.” Gray data is the raw and unvalidated data that
arrives from various sources, in huge quantities, and not
in the most usable form. Yet gray data can deliver value
to the business even if the generators of that content
(for example, people calling directory assistance) are
contributing that data for a reason far different from
improving voice-recognition algorithms. They just want
the right phone number; the data they leave is a gift to
the company providing the service.
The new technologies and services described in the
article, “Building a bridge to the rest of your data,” on
page 22 are making it possible to search for enterprise
value in gray data in agile ways at low cost. Much of
this value is likely to be in the area of knowing your
customers, a sure path for CIOs looking for ways to
contribute to company growth and deepen their
relationships with the rest of the C-suite.
What Web enterprise use of Big Data shows
CIOs, most of all, is that there is a way to think and
manage differently when you conclude that standard
transactional data analysis systems are not and
should not be the only models. New models are
emerging. CIOs who recognize these new models
without throwing away the legacy systems that still
serve them well will see that having more than one
tool set, one skill set, and one set of controls makes
their organizations more sophisticated, more agile,
less expensive to maintain, and more valuable to
the business.
The business case
Besides Google, Yahoo, and other Web-based
enterprises that have complex data sets, there are
stories of brick and mortar organizations that will be
making more use of Big Data. For example, Rollin Ford,
Wal-Mart’s CIO, told The Economist earlier this year,
“Every day I wake up and ask, ‘How can I flow data
better, manage data better, analyze data better?’”
The answer to that question today implies a budget
reallocation, with less-expensive hardware and software
carrying more of the load. “I see inspiration from the
Google model and the notion of moving into
commodity-based computing—just having lots of
cheap stuff that you can use to crunch vast quantities
of data. I think that really contrasts quite heavily with
the historic model of paying lots of money for really
specialist stuff,” says Phil Buckle, CTO of the UK’s
National Policing Improvement Agency, which oversees
law enforcement infrastructure nationwide. That’s a new
mind-set for the CIO, who ordinarily focuses on keeping
the plumbing and the data it carries safe, secure,
in-house, and functional.
Seizing the Big Data initiative would give CIOs in
particular and IT in general more clout in the executive
suite. But are CIOs up to the task? “It would be a
positive if IT could harness unstructured data
effectively,” former Gartner analyst Howard Dresner,
president and founder of Dresner Advisory Services,
observes. “However, they haven’t always done a great
job with structured data, and unstructured is far more
complex and exists predominantly outside the firewall
and beyond their control.”
Tools are not the issue. Many evolving tools, as noted
in the previous article, come from the open-source
community; they can be downloaded and experimented
with for low cost and are certainly up to supporting
any pilot project. More important is the aforementioned
mind-set and a new kind of talent IT will need.
39
“The talent demand isn’t so much for Java developers or statisticians
per se as it is for people who know how to work with denormalized
data.” —Ray Velez of Razorfish
To whom does the future of IT belong?
The ascendance of Big Data means that CIOs need a
more data-centric approach. But what kind of talent
can help a CIO succeed in a more data-centric business
environment, and what specific skills do the CIO’s
teams focused on the area need to develop
and balance?
Hal Varian, a University of California, Berkeley, professor
and Google’s chief economist, says, “The sexy job in
the next 10 years will be statisticians.” He and others,
such as IT and management professor Erik Brynjolfsson
at the Massachusetts Institute of Technology (MIT),
contend this demand will happen because the amount
of data to be analyzed is out of control. Those who
can make sense of the flood will reap the greatest
rewards. They have a point, but the need is not just
for statisticians—it’s for a wide range of analytically
minded people.
Today, larger companies still need staff with expertise
in package implementations and customizations,
systems integration, and business process
reengineering, as well as traditional data management
and business intelligence that’s focused on
transactional data. But there is a growing role for
people with flexible minds to analyze data and suggest
solutions to problems or identify opportunities from
that data.
In Silicon Valley and elsewhere, where businesses such
as Google, Facebook, and Twitter are built on the
rigorous and speedy analysis of data, programming
frameworks such as MapReduce (which works with
Hadoop) and NoSQL (a database approach for nonrelational data stores) are becoming more popular.
40
Chris Wensel, who created Cascading (an alternative
application programming interface [API] to MapReduce)
and straddles the worlds of startups and entrenched
companies, says, “When I talk to CIOs, I tell them:
‘You know those people you have who know about
data. You probably don’t use those people as much
as you should. But once you take advantage of that
expertise and reallocate that talent, you can take
advantage of these new techniques.’”
The increased emphasis on data analysis does not
mean that traditional programmers will be replaced by
quantitative analysts or data warehouse specialists.
“The talent demand isn’t so much for Java developers
or statisticians per se as it is for people who know how
to work with denormalized data,” says Ray Velez, CTO
at Razorfish, an interactive marketing and technology
consulting firm involved in many Big Data initiatives.
“It’s about understanding how to map data into a format
that most people are not familiar with. Most people
understand SQL and the relational format, so the real
skill set evolution doesn’t have quite as much to do with
whether it’s Java or Python or other technologies.”
Velez points to Bill James as a useful case. James, a
baseball writer and statistician, challenged conventional
wisdom by taking an exploratory mind-set to baseball
statistics. He literally changed how baseball
management makes talent decisions, and even how
they manage on the field. In fact, James became senior
advisor for baseball operations in the Boston Red Sox’s
front office.
For example, James showed that batting average is
less an indicator of a player’s future success than
how often he’s involved in scoring runs—getting on
base, advancing runners, or driving them in. In this
example and many others, James used his knowledge
of the topic, explored the data, asked questions no
one had asked, and then formulated, tested, and
refined hypotheses.
Says Velez: “Our analytics team within Razorfish has
the James types of folks who can help drive different
thinking and envision possibilities with the data. We
need to find a lot more of those people. They’re not
very easy to find. There is an aspect of James that
just has to do with boldness and courage, a willingness
to challenge those who are in the habit of using
metrics they’ve been using for years.”
The CIO will need people throughout the organization
who have all sorts of relevant analysis and coding skills,
who understand the value of data, and who are not
afraid to explore. This does not mean the end of the
technology- or application-centric organizational chart
of the typical IT organization. Rather, it means the
addition of a data-exploration dimension that is more
than one or two people. These people will be using a
blend of tools that differ depending on requirements,
as Table 2 illustrates. More of the tools will be open
source than in the past.
Skills
Tools (a sampler)
Comments
Natural language processing
and text mining
Clojure, Redis, Scala, Crane, other
Java functional language libraries,
Python Natural Language ToolKit
To some extent, each of these serves as
a layer of abstraction on top of Hadoop.
Those familiar keep adding layers on top of
layers. FlightCaster, for example, uses a
stack consisting of Amazon S3 -> Amazon
EC2 -> Cloudera -> HDFS -> Hadoop ->
Cascading -> Clojure.1
Data mining
R, Mathlab
R is more suited to finance and statistics,
whereas Mathlab is more engineering
oriented.2
Scripting and NoSQL
database programming skills
Python and related frameworks,
HBase, Cassandra, CouchDB,
Tokyo Cabinet
These lend themselves to or are based on
functional languages such as LISP, or
comparable to LISP. CouchDB, for
example, is written in Erlang.3 (See the
discussion of Clojure and LISP on page 30.)
Table 2: New skills and tools for the IT department
Source: Cited online postings and PricewaterhouseCoopers, 2008–2010
Pete Skomoroch, “How FlightCaster Squeezes Predictions from Flight Data,” Data Wrangling blog, August 24, 2009,
http://www.datawrangling.com/how-flightcaster-squeezes-predictions-from-flight-data (accessed May 14, 2010).
1
Brendan O’Connor, “Comparison of data analysis packages,” AI and Social Science blog, February 23, 2009,
http://anyall.org/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/
(accessed May 25, 2010).
2
Scripting languages such as Python run more slowly than Java, but developers sometimes make the tradeoff
to increase their own productivity. Some companies have created their own frameworks and released these
to open source. See Klaas Bosteels, “Python + Hadoop = Flying Circus Elephant,” Last.HQ Last.fm blog,
May 29, 2008, http://blog.last.fm/2008/05/29/python-hadoop-flying-circus-elephant (accessed May 14, 2010).
3
41
“Every technology department has a skunkworks, no matter how
informal—a sandbox where they can test and prove technologies.
That’s how open source entered our organization. A small Hadoop
installation might be a gateway that leads you to more open source.
But it might turn out to be a neat little open-source project that
sits by itself and doesn’t bother anything else.” —CIO of a small
Massachusetts company
Where do CIOs find such talent? Start with your own
enterprise. For example, business analysts managing
the marketing department’s lead-generation systems
could be promoted onto an IT data staff charged with
exploring the data flow. Most large consumer-oriented
companies already have people in their business units
who can analyze data and suggest solutions to
problems or identify opportunities. These people need
to be groomed and promoted, and more of them hired
for IT, to enable the entire organization, not just the
marketing department, to reap the riches.
Set up a sandbox
Although the business case CIOs can make for Big Data
is inarguable, even inarguable business cases carry
some risk. Many CIOs will look at the risks associated
with Big Data and find a familiar canard. Many Big Data
technologies—Hadoop in particular—are open source,
and open source is often criticized for carrying too
much risk.
The open-source versus proprietary technology
argument is nothing new. CIOs who have tried to
implement open-source programs, from the Apache
Web server to the Drupal content-management system,
have faced the usual arguments against code being
available to all comers. Some of those arguments,
especially concerns revolving around security and
reliability, verge on the specious. Google built its internal
Web servers atop Apache. And it would be difficult to
find a Big Data site as reliable as Google’s.
42
Clearly, one challenge CIOs face has nothing to do
with data or skill sets. Open-source projects become
available earlier in their evolution than do proprietary
alternatives. In this respect, Big Data tools are less
stable and complete than are Apache or Linux
open-source tool kits.
Introducing an open-source technology such as
Hadoop into a mostly proprietary environment does
not necessarily mean turning the organization upside
down. A CIO at a small Massachusetts company says,
“Every technology department has a skunkworks,
no matter how informal—a sandbox where they can
test and prove technologies. That’s how open source
entered our organization. A small Hadoop installation
might be a gateway that leads you to more open
source. But it might turn out to be a neat little opensource project that sits by itself and doesn’t bother
anything else. Either can be OK, depending on the
needs of your company.”
Bud Albers, executive vice president and CTO of
Disney Technology Shared Services Group, concurs.
“It depends on your organizational mind-set,” he says.
“It depends on your organizational capability. There
is a certain ‘don’t try this at home’ kind of warning
that goes with technologies like Hadoop. You have to
be willing at this stage of its maturity to maybe have
a little higher level of capability to go in.”
PricewaterhouseCoopers agrees with those sentiments
and strongly urges large enterprises to establish a
sandbox dedicated to Big Data and Hadoop/
MapReduce. This move should be standard operating
procedure for large companies in 2010, as should a
small, dedicated staff of data explorers and modest
budget for the efforts. For more information on what
should be in your sandbox, refer to the article, “Building
a bridge to the rest of your data,” on page 22.
And for some ideas on how the sandbox could fit in
your organization chart, see Figure 1.
VP of IT
Marketing
manager
Director of application development
Web site
manager
Director of data analysis
Sales
manager
Data
Data analysis
exploration
team
team
Operations
manager
Finance
manager
Figure 1: Where a data exploration team might fit in an
organization chart
Different companies will want to experiment with
Hadoop in different ways, or segregate it from the rest
of the IT infrastructure with stronger or weaker walls.
The CIO must determine how to encourage this kind
of experimentation.
Understand and manage the risks
Some of the risks associated with Big Data are
legitimate, and CIOs must address them. In the case of
Hadoop clusters, security is a pressing question: it was
a feature added as the project developed, not cooked
in from the beginning. It’s still far from perfect. Many
open-source projects start as projects intended to
prove a concept or solve a particular problem. Some,
such as Linux or Mozilla, become massive successes,
but they rarely start with the sort of requirements a CIO
faces when introducing systems to corporate settings.
Beyond open source, regardless of which tools are
used to manipulate data, there are always risks
associated with making decisions based on the analysis
of Big Data. To give one dramatic example, the recent
financial crisis was caused in part by banks and rating
agencies whose models for understanding value at
risk and the potential for securities based on subprime
mortgages to fail were flat-out wrong. Just as there is
risk in data that is not sufficiently clean, there is risk
in data manipulation techniques that have not been
sufficiently vetted. Many times, the only way to
understand big, complicated data is through the use
of big, complicated algorithms, which leaves a door
open to big, catastrophic mistakes in analysis.
Proactively preventing these mistakes from happening
requires the risk mind-set, says Larry Best, IT risk
manager at PricewaterhouseCoopers. “You have to
think carefully about what can go wrong, do a
quantitative analysis of the likelihood of such a mistake,
and anticipate the impact if the mistake occurs.”
43
Table 3 includes a list of the risks associated with Big Data analysis and ways to mitigate them.
Risk
Mitigation tactic
Over-reliance on insights gleaned from
data analysis leads to loss
Testing
Inaccurate or obsolete data
Maintain strong metadata management; unverified information must be flagged
Analysis leads to paralysis
Keep the sandbox related to the business problem or opportunity
Security
Keep the Hadoop clusters away from the firewall, be vigilant, ask chief security
officer for help
Buggy code and other glitches
Make sure the team keeps track of modifications and other implementation
history, since documentation isn’t plentiful
Rejection by other parts of
the organization
Perform change management to help improve the odds of acceptance,
along with quick successes
Table 3: How to mitigate the risks of Big Data analysis
Best points out that choosing the right controls to
implement in a given risk scenario is essential. The only
way to make sound choices is by adopting a risk
mind-set and approach, that allow a focus on the most
critical controls, he says. Enterprises simply don’t have
the resources to implement blanket controls. The
Control Objectives for Information and related
Technology (COBIT) framework, a popular reference for
IT risk assessment, is a “phone book of thousands of
controls.” he says. Risk is not juggling a lot of balls.
“Risk is knowing which balls are made out
of rubber and which are made out of glass.”
By nature and by work experiences, most CIOs are
risk averse. Blue-chip CIOs hold off installing new
versions of software until they have been proven
beyond a doubt, and these CIOs don’t standardize
on new platforms until the risk for change appears to
be less than the risk of stasis.
“The fundamental issue is whether IT is willing to
depart from the status quo, such as an RDBMS
[relational database management system], in favor
44
of more powerful technologies,” Dresner says.
“This means massive change, and IT doesn’t
always embrace change.” More forward-thinking
IT organizations constantly review their software
portfolio and adjust accordingly.
In this case, the need to manipulate larger and larger
amounts of data that companies are collecting is
pressing. Even risk-averse CIOs are exploring the
possibilities of Big Data for their businesses. Bud
Mathaisel, CIO of the outsourcing vendor Achievo,
divides the risks of Big Data and their solutions into
three areas:
• Accessibility—The data repository used for data
analysis should be access managed.
• Classification—Gray data should be identified
as such.
• Governance—Who’s doing what with this?
Yes, Big Data is new. But accessibility, classification,
and governance are matters CIOs have had to deal
with for many years in many guises.
Conclusion
At many companies, Big Data is both an opportunity
(what useful needles can we find in a terabyte-sized
haystack?) and a source of stress (Big Data is
overwhelming our current tools and methods; they
don’t scale up to meet the challenge). The prefix
“tera” in “terabyte,” after all, comes from the Greek
word for “monster.” CIOs aiming to use Big Data to
add value to their businesses are monster slayers.
CIOs don’t just manage hardware and software now;
they’re expected to manage the data stored in that
hardware and used by that software—and provide a
framework for delivering insights from the data.
From Amazon.com to the Boston Red Sox, diverse
companies compete based on what data they collect
and what they learn from it. CIOs must deliver easy,
reliable, secure access to that data and develop
consistent, trustworthy ways to explore and wrench
wisdom from that data. CIOs do not need to rush, but
they do need to be prepared for the changes that Big
Data is likely to require.
Perhaps the most productive way for CIOs to frame the
issue is to acknowledge that Big Data isn’t merely a
new model; it’s a new way to think about all data
models. Big Data isn’t merely more data; it is different
data that requires different tools. As more and more
internal and external sources cast off more and more
data, basic notions about the size and attributes of data
sets are likely to change. With those changes, CIOs will
be expected to capture more data and deliver it to the
executive team in a manner that reveals the business—
and how to grow it—in new ways.
Web companies have set the bar high already. John
Avery, a partner at Sungard Consulting Services, points
to the YouTube example: “YouTube’s ability to index a
data store of such immense size and then accrete
additional analysis on top of that, as an ongoing
process with no foresight into what those analyses
would look like when the data was originally stored,
is very, very impressive. That is something that has
challenged folks in financial technology for years.”
As companies with a history of cautious data policies
begin to test and embrace Hadoop, MapReduce, and
the like, forward-looking CIOs will turn to the issues that
will become more important as Big Data becomes the
norm. The communities arising around Hadoop (and the
inevitable open-source and proprietary competitors that
follow) will grow and become influential, inspiring more
CIOs to become more data-centric. The profusion of
new data sources will lead to dramatic growth in the
use and diversity of metadata. As the data grows, so
will our vocabulary for understanding it.
Whether learning from Google’s approach to Big Data,
hiring a staff primed to maximize its value, or managing
the new risks, forward-looking CIOs will, as always,
be looking to enable new business opportunities
through technology.
As companies with a history of
cautious data policies begin to
test and embrace Hadoop,
MapReduce, and the like, forwardlooking CIOs will turn to the issues
that will become more important
as Big Data becomes the norm.
The communities arising around
Hadoop (and the inevitable opensource and proprietary competitors
that follow) will grow and become
influential, inspiring more CIOs to
become more data-centric.
45
Razorfish’s Mark Taylor and Ray Velez
discuss how new techniques enable
them to better analyze petabytes of
Web data.
Interview conducted by Alan Morrison and Bo Parker
Mark Taylor is global solutions director and Ray Velez is CTO of
Razorfish, an interactive marketing and technology consulting
firm that is now a part of Publicis Groupe. In this interview, Taylor and Velez discuss how they use Amazon’s Elastic
Compute Cloud (EC2) and Elastic MapReduce services, as well as Microsoft Azure Table services, for large-scale
customer segmentation and other data mining functions.
PwC: What business problem were you trying to
solve with the Amazon services?
MT: We needed to join together large volumes of
disparate data sets that both we and a particular client
can access. Historically, those data sets have not been
able to be joined at the capacity level that we were able
to achieve using the cloud.
In our traditional data environment, we were limited
to the scope of real clickstream data that we could
actually access for processing and leveraging
bandwidth, because we procured a fixed size of data.
We managed and worked with a third party to serve
that data center.
This approach worked very well until we wanted to tie
together and use SQL servers with online analytical
processing cubes, all in a fixed infrastructure. With the
cloud, we were able to throw billions of rows of data
together to really start categorizing that information
so that we could segment non-personally identifiable
data from browsing sessions and from specific ways
in which we think about segmenting the behavior
of customers.
46
That capability gives us a much smarter way to apply
rules to our clients’ merchandising approaches, so that
we can achieve far more contextual focus for the use of
the data. Rather than using the data for reporting only,
we can actually leverage it for targeting and think about
how we can add value to the insight.
RV: It was slightly different from a traditional database
approach. The traditional approach just isn’t going to
work when dealing with the amounts of data that a tool
like the Atlas ad server [a Razorfish ad engine that is
now owned by Microsoft and offered through Microsoft
Advertising] has to deal with.
PwC: The scalability aspect of it seems clear.
But is the nature of the data you’re collecting
such that it may not be served well by a
relational approach?
RV: It’s not the nature of the data itself, but what we
end up needing to deal with when it comes to relational
data. Relational data has lots of flexibility because of
the normalized format, and then you can slice and dice
and look at the data in lots of different ways. Until you
“Rather than using the data for reporting only, we can actually leverage it
for targeting and think about how we can add value to the insight.”
—Mark Taylor
put it into a data warehouse format or a denormalized
EMR [Elastic MapReduce] or Bigtable type of format,
you really don’t get the performance that you need
when dealing with larger data sets.
So it’s really that classic tradeoff; the data doesn’t
necessarily lend itself perfectly to either approach.
When you’re looking at performance and the amount
of data, even a data warehouse can’t deal with the
amount of data that we would get from a lot of our
data sources.
PwC: What motivated you to look at this new
technology to solve that old problem?
RV: Here’s a similar example where we used a slightly
different technology. We were working with a large
financial services institution, and we were dealing with
massive amounts of spending-pattern and anonymous
data. We knew we had to scale to Internet volumes,
and we were talking about columnar databases. We
wondered, can we use a relational structure with
enough indexes to make it perform well? We
experimented with a relational structure and it
just didn’t work.
So early on we jumped into what Microsoft Azure
technology allowed us to do, and we put it into a
Bigtable format, or a Hadoop-style format, using Azure
Table services. The real custom element was designing
the partitioning structure of this data to denormalize
what would usually be five or six tables into one huge
table with lots of columns, to the point where we started
to bump up against the maximum number of columns
they had.
We were able to build something that we never would
have thought of exposing to the world, because it never
would have performed well. It actually spurred a whole
new business idea for us. We were able to take what
would typically be a BusinessObjects or a Cognos
application, which would not scale to Internet volumes.
We did some sizing to determine how big the data
footprint would be. Obviously, when you do that, you
tend to have a ton more space than you require,
because you’re duplicating lots and lots of data that,
with a relational database table, would be lookup data
or other things like that. But it turned out that when I
laid the indexes on top of the traditionally relational
data, the resulting data set actually had even greater
storage requirements than performing the duplication
and putting the data set into a large denormalized
format. That was a bit of a surprise to us. The size of
the indexes got so large.
When you think about it, maybe that’s just how an index
works anyway—it puts things into this denormalized
format. An index file is just some closed concept in
your database or memory space. The point is, we would
have never tried to expose that to consumers, but we
were able to expose it to consumers because of this
new format.
MT: The first commercial benefits were the ability to
aggregate large and disparate data into one place and
the extra processing power. But the next phase of
benefits really derives from the ability to identify true
relationships across that data.
47
“The stat section on [the MLB] site was always the most difficult part of
the site, but the business insisted it needed it.” —Ray Velez
Tiny percentages of these data sets have the most
significant impact on our customer interactions. We
are already developing new data measurement and KPI
[key performance indicator] strategies as we’re starting
to ask ourselves, “Do our clients really need all of the
data and measurement points to solve their
business goals?”
PwC: Given these new techniques, is the skill
set that’s most beneficial to have at Razorfish
changing?
RV: It’s about understanding how to map data into a
format that most people are not familiar with. Most
people understand SQL and relational format, so I think
the real skill set evolution doesn’t have quite as much
to do with whether the tool of choice is Java or Python
or other technologies; it’s more about do I understand
normalized versus denormalized structures.
MT: From a more commercial viewpoint, there’s a shift
away from product type and skill set, which is based
around constraints and managing known parameters,
and very much more toward what else can we do.
It changes the impact—not just in the technology
organization, but in the other disciplines as well.
I’ve already seen a profound effect on the old ways of
doing things. Rather than thinking of doing the same
things better, it really comes down to having the people
and skills to meet your intended business goals.
Using the Elastic MapReduce service with Cascading,
our solutions can have a ripple effect on all of the
non-technical business processes and engagements
across teams. For example, conventional marketing
segmentation used to involve teams of analysts who
waded through various data sets and stages of
processing and analysis to make sense of how a
business might view groups of customers. Using the
Hadoop-style alternative and Cascading, we’re able to
identify unconventional relationships across many data
points with less effort, and in the process create new
segmentations and insights.
48
This way, we stay relevant and respond more quickly to
customer demand. We’re identifying new variations and
shifts in the data on a real-time basis that would have
taken weeks or months, or that we might have missed
completely, using the old approach. The analyst’s role
in creating these new algorithms and designing new
methods of campaign planning is clearly key to this
type of solution design. The outcome of all this is really
interesting and I’m starting to see a subtle, organic
response to different changes in the way our solution
tracks and targets customer behavior.
PwC: Are you familiar with Bill James, a Major
League Baseball statistician who has taken a
rather different approach to metrics? James
developed some metrics that turned out to be
more useful than those used for many years
in baseball. That kind of person seems to be
the type that you’re enabling to hypothesize,
perhaps even do some machine learning to
generate hypotheses.
RV: Absolutely. Our analytics team within Razorfish has
the Bill James type of folks who can help drive different
thinking and envision possibilities with the data. We
need to find a lot more of those people. They’re not
very easy to find. And we have some of the leading
folks in the industry.
You know, a long, long time ago we designed the
Major League Baseball site and the platform. The stat
section on that site was always the most difficult part
of the site, but the business insisted it needed it. The
amount of people who really wanted to churn through
that data was small. We were using Oracle at the time.
We used the concept of temporary tables, which
would denormalize lots of different relational tables for
performance reasons, and that was a challenge. If I
had the cluster technology we do now back in 1999
and 2000, we could have built to scale much more
than going to two measly servers that we could cluster.
PwC: The Bill James analogy goes beyond
batting averages, which have been the age-old
metric for assessing the contribution of a hitter
to a team, to measuring other things that weren’t
measured before.
RV: Even crazy stuff. We used to do things like, show
me all of Derek Jeter’s hits at night on a grass field.
PwC: There you go. Exactly.
RV: That’s the example I always use, because that was
the hardest thing to get to scale, but if you go to the
stat section, you can do a lot of those things. But if too
many people went to the stat section on the site, the
site would melt down, because Oracle couldn’t handle
it. If I were to rebuild that today, I could use an EMR
or a Bigtable and I’d be much happier.
PwC: Considering the size of the Bigtable that
you’re able to put together without using joins, it
seems like you’re initially able to filter better and
maybe do multistage filtering to get to something
useful. You can take a cyclical approach to your
analysis, correct?
RV: Yes, you’re almost peeling away the layer of the
onion. But putting data into a denormalized format does
restrict flexibility, because you have so much more
power with a where clause than you do with a standard
EMR or Bigtable access mechanism.
It’s like the difference between something built for
exactly one task versus something built to handle tasks
I haven’t even thought of. If you peel away the layer of
the onion, you might decide, wow, this data’s interesting
and we’re going in a very interesting direction, so what
about this? You may not be able to slice it that way. You
might have to step back and come up with a different
partition structure to support it.
PwC: Social media is helping customers become
more active and engaged. From a marketing
analysis perspective, it’s a variation on a Super
Bowl advertisement, just scaled down to that
social media environment. And if that’s going to
happen frequently, you need to know what is the
impact, who’s watching it, and how are the
people who were watching it affected by it. If
you just think about the data ramifications of
that, it sort of blows your mind.
RV: If you think about the popularity of Hadoop and
Bigtable, which is really looking under the covers of
the way Google does its search, and when you think
about search at the end of the day, search really is
recommendations. It’s relevancy. What are the impacts
on the ability of people to create new ways to do search
and to compete in a more targeted fashion with the
search engine? If you look three to five years out, that’s
really exciting. We used to say we could never re-create
that infrastructure that Google has; Google is the
second largest server manufacturer in the world. But
now we have a way to create small targeted ways of
doing what Google does. I think that’s pretty exciting. n
“What are the impacts on the ability
of people to create new ways to
do search and to compete in a
more targeted fashion with the
search engine? If you look three to
five years out, that’s really exciting.”
—Ray Velez
49
Acknowledgments
Advisory
Sponsor & Technology Leader
Tom DeGarmo
US Thought Leadership
Partner-in-Charge
Tom Craren
Center for Technology and Innovation
Managing Editor
Bo Parker
Editors
Vinod Baya, Alan Morrison
Contributors
Larry Best, Galen Gruman, Jimmy Guterman, Larry Marion, Bill Roberts
Editorial Advisers
Markus Anderle, Stephen Bay, Brian Butte, Tom Johnson,
Krishna Kumaraswamy, Bud Mathaisel, Sean McClowry, Rajesh Munavalli,
Luis Orama, Dave Patton, Jonathan Reichental, Terry Retter, Deepak Sahi,
Carter Shock, David Steier, Joe Tagliaferro, Dimpsy Teckchandani,
Cindi Thompson, Tom Urquhart, Christine Wendin, Dean Wotkiewich
Copyedit
Lea Anne Bantsari, Ellen Dunn
Transcription
Paula Burns
50
Graphic Design
Industry perspectives
Art Director
Jacqueline Corliss
During the preparation of this publication, we benefited
greatly from interviews and conversations with the
following executives and industry analysts:
Designers
Jacqueline Corliss, Suzanne Lau
Illustrators
Donald Bernhardt, Suzanne Lau,
Tatiana Pechenik
Photographers
Tim Szumowski, David Tipling
(Getty Images), Marina Waltz
Bud Albers, executive vice president and chief
technology officer, Technology Shared Services
Group, Disney
Matt Aslett, analyst, enterprise software, the451
John Avery, partner, Sungard Consulting Services
Amr Awadallah, vice president, engineering,
and chief technology officer, Cloudera
Online
Phil Buckle, chief technology officer, National Policing
Improvement Agency
Director, Online Marketing
Jack Teuber
Howard Dresner, president and founder, Dresner
Advisory Services
Designer and Producer
Scott Schmidt
Brian Donnelly, founder and chief executive officer,
InSilico Discovery
Reviewers
Dave Stuckey, Chris Wensel
Marketing
Bob Kramer
Special thanks to
Ray George, Page One
Rachel Lovinger, Razorfish
Mariam Sughayer, Disney
Matt Estes, principal data architect, Technology Shared
Services Group, Disney
Jim Kobelius, senior analyst, Forrester Research
Doug Lenat, founder and chief executive officer, Cycorp
Roger Magoulas, research director, O’Reilly Media
Nathan Marz, lead engineer, BackType
Bill McColl, founder and chief executive officer,
Cloudscale
John Parkinson, acting chief technology officer,
TransUnion
David Smoley, chief information officer, Flextronics
Mark Taylor, global solutions director, Razorfish
Scott Thompson, vice president, architecture,
Technology Shared Services Group, Disney
Ray Velez, chief technology officer, Razorfish
Acknowledgments
51
pwc.com/us
To have a deeper conversation
about how this subject may affect
your business, please contact:
Tom DeGarmo
Principal, Technology Leader
PricewaterhouseCoopers
+1 267-330-2658
[email protected]
This publication is printed on Coronado Stipple Cover made from 30% recycled fiber; and
Endeavor Velvet Book made from 50% recycled fiber, a Forest Stewardship Council (FSC)
certified stock using 25% post-consumer waste.
Recycled paper
Subtext
Big Data
Data sets that range from many terabytes to petabytes in size, and
that usually consist of less-structured information such as Web log files.
Hadoop cluster
A type of scalable computer cluster inspired by the Google
Cluster Architecture and intended for cost-effectively processing
less-structured information.
Apache Hadoop
The core of an open-source ecosystem that makes Big Data analysis
more feasible through the efficient use of commodity computer clusters.
Cascading
A bridge from Hadoop to common Java-based programming techniques
not previously usable in cluster-computing environments.
NoSQL
A class of non-relational data stores and data analysis techniques that
are intended for various kinds of less-structured data. Many of these
techniques are part of the Hadoop ecosystem.
Gray data
Data from multiple sources that isn’t formatted or vetted for
specific needs, but worth exploring with the help of Hadoop
cluster analysis techniques.
Comments or requests? Please visit www.pwc.com/techforecast OR send e-mail to: [email protected]
PricewaterhouseCoopers (www.pwc.com) provides industry-focused assurance, tax and advisory services to build public trust and enhance value for
its clients and their stakeholders. More than 155,000 people in 153 countries across our network share their thinking, experience and solutions to
develop fresh perspectives and practical advice.
© 2010 PricewaterhouseCoopers LLP. All rights reserved. “PricewaterhouseCoopers” refers to PricewaterhouseCoopers LLP, a Delaware limited
liability partnership, or, as the context requires, the PricewaterhouseCoopers global network or other member firms of the network, each of which is a
separate and independent legal entity. This document is for general information purposes only, and should not be used as a substitute for consultation
with professional advisors.

Tapping into the power of Big Data

Transcripción

Documentos relacionados

WDWOPS-11-21375 DTD JAN GM RV.indd

Innovation`s new world order The 2015 Global - Strategy

Tax policy and administration: Global trends/June 2012 In the

So be goodfor goodness sake!

Digital Transformation - Engage customers through social

management positioning in online markets

What Would Google Do?

PwC Paying Taxes

Nera report - Coalición Pro Internet