Big data – from buzzword to strategy
Andreas Dietze is Partner at our InfoCom Competence Center
Mainframe computers, desktop clients, smart phones, self-service machines and embedded systems in vehicles or aircraft – all of these systems generate enormous volumes of data that contain valuable information on business processes, products and customers. Although low-cost infrastructures enable the processing of such data volumes, companies are not yet making systematic use of the information to establish a competitive edge.
One of the main reasons for this is that the principles of data processing have not really adapted to changes in circumstances. The relational databases used by companies today are based on a model that is absolutely reliable for read and write access. But it is not designed for large volumes of data on the terabyte or even petabyte scale. Systems like these can only be scaled up by upgrading the components in the database server, which limits the size they can attain.Big data takes a different path
Big data breaks with this tradition and establishes a new data processing principle, which works on the assumption that the existing basis of data is only read, not edited. The processing is distributed in such a way as to enable the infrastructure to be scaled out flexibly to suit the scale of the problem. Google developed this approach, which it named MapReduce, as the core of its production infrastructure. It eventually evolved into the popular open-source project, Hadoop – the standard for big data technology today.
The advantages of this technology are clear: it provides the possibility to process large volumes of data (petabytes) and the flexibility to choose the basic infrastructure. Everything from simple, low-cost commodity hardware to a cloud-based infrastructure – there are no limits here. Amazon Web Services, for instance, offers preconfigured Hadoop systems, and Microsoft Azure will soon be doing so as well.
One of the main reasons for this is that the principles of data processing have not really adapted to changes in circumstances. The relational databases used by companies today are based on a model that is absolutely reliable for read and write access. But it is not designed for large volumes of data on the terabyte or even petabyte scale. Systems like these can only be scaled up by upgrading the components in the database server, which limits the size they can attain.Big data takes a different path
Big data breaks with this tradition and establishes a new data processing principle, which works on the assumption that the existing basis of data is only read, not edited. The processing is distributed in such a way as to enable the infrastructure to be scaled out flexibly to suit the scale of the problem. Google developed this approach, which it named MapReduce, as the core of its production infrastructure. It eventually evolved into the popular open-source project, Hadoop – the standard for big data technology today.
The advantages of this technology are clear: it provides the possibility to process large volumes of data (petabytes) and the flexibility to choose the basic infrastructure. Everything from simple, low-cost commodity hardware to a cloud-based infrastructure – there are no limits here. Amazon Web Services, for instance, offers preconfigured Hadoop systems, and Microsoft Azure will soon be doing so as well.
Curt Cramer is Project Manager and co-author of this column
Even China Mobile, the biggest of China's cell phone providers, has developed a Hadoop-based solution to analyze things like call detail records to look at the usage behavior and churn probability of its customers. These analyses support the company's marketing and go some way to improving the networks and service quality. The scale-up solutions they previously employed enabled them to analyze just 10% of customers' data. But the Hadoop solution achieves a dual goal, enabling the company to analyze all of its call detail records and cut costs at the same time. The use of commodity hardware means that the new solution cost just one-fifth of what the company used to pay – and provides much better performance into the bargain.
Relative lack of case studies hinders big data marketing
Despite all of their technological benefits, big data systems have not yet caught on in the market. IT market research firm Gartner estimates that just 20% of the big data initiatives around today are currently in the process of being implemented. And by 2015, no more than about 15% of businesses will have made the switch to big data processing.
One of the biggest obstacles to the success of big data on the market is the lack of case studies from a range of industries – most people cannot really grasp what big data is all about. There are not enough specific examples of its application that can provide actual evidence of the value that this technology adds. Nevertheless, some companies and institutions have already announced plans to concentrate more on big data:
The current market for appropriate IT solutions represents another obstacle to the success of big data. Numerous providers currently offer Hadoop-based solutions, including companies like Cloudera, Hortonworks, Datameer and HStreaming, as well as big names such as IBM and EMC.
But all of them come up against a significant barrier: not one of these firms has any standardized industry solutions that can quickly be adapted to customers' needs. In many cases the solutions first need to be developed in joint projects with the customers, given that the IT firms have specialized in adapting the basic technologies around Hadoop.IT departments and other divisions are not yet geared up for big data
A company's IT experts need a different set of skills to those required for systems that support today's standard of data processing if they plan to implement a big data system. Three aspects are especially important here: data analysis, data visualization and technical skills.
Relative lack of case studies hinders big data marketing
Despite all of their technological benefits, big data systems have not yet caught on in the market. IT market research firm Gartner estimates that just 20% of the big data initiatives around today are currently in the process of being implemented. And by 2015, no more than about 15% of businesses will have made the switch to big data processing.
One of the biggest obstacles to the success of big data on the market is the lack of case studies from a range of industries – most people cannot really grasp what big data is all about. There are not enough specific examples of its application that can provide actual evidence of the value that this technology adds. Nevertheless, some companies and institutions have already announced plans to concentrate more on big data:
- The New York Presbyterian Hospital achieved a 25% reduction in the number of cases of fatal thrombosis by implementing a systematic analysis of patient histories (source: Hortonworks).
- The Los Angeles Police Department employed predictive policing in a pilot project. By applying this solution, the police were able to zero in on crime hotspots and peak times for crime in advance (source: Cloudera).
- Trucking company US Xpress is saving several million dollars a year by analyzing sensor and geodata from its fleet of trucks. Shorter idle time and reduced fuel consumption are what help the business make these savings (source: Informatica).
- Financial service provider JP Morgan Chase has been using Hadoop for around three years for fraud detection and IT risk management (source: JP Morgan Chase).
- Retailer Sears is able to analyze the price elasticity of its products on a weekly basis thanks to Hadoop. The system looks at aspects like product availability and competitors' prices. Previously, the company was only able to use around 10% of the data in its records for such analyses; the calculations used to take around eight weeks (source: Wall Street Journal).
The current market for appropriate IT solutions represents another obstacle to the success of big data. Numerous providers currently offer Hadoop-based solutions, including companies like Cloudera, Hortonworks, Datameer and HStreaming, as well as big names such as IBM and EMC.
But all of them come up against a significant barrier: not one of these firms has any standardized industry solutions that can quickly be adapted to customers' needs. In many cases the solutions first need to be developed in joint projects with the customers, given that the IT firms have specialized in adapting the basic technologies around Hadoop.IT departments and other divisions are not yet geared up for big data
A company's IT experts need a different set of skills to those required for systems that support today's standard of data processing if they plan to implement a big data system. Three aspects are especially important here: data analysis, data visualization and technical skills.
The big data model is inherently different from the widely established relational data model
Data processing normally involves an analysis in the form of a standard query in mature and user-friendly BI programs, which is followed by a standard report. For big data application, the analyst first needs to define the data sources and then prepare them for automated processing. The analyst therefore needs to specify in advance the data cleansing rules, data formats and main parameters among all of the many data sources. This explorative approach differs from the standardized method in common usage today.
Visualization plays a pivotal role here: in current business practice, reports help communicate analysis results to decision makers in a standardized form. But analysts developing big data applications do not access standardized reports –they use visualization instead to help them quickly pick out statistical patterns and trends. Only in the next step can they present the customer with standard reports containing facts drawn from a range of different data sources.
For a Hadoop-based analysis, the experts do need sound knowledge of the framework itself and of the technologies around it (HDFS, HBase, Hive, Mahout). Yet the analysis is not the only part of the process that these skills are required for; they are also needed beforehand, to weigh up the big data methods. Companies do not currently have the internal resources to do this. That's because the necessary technologies have not been developed by the leading database producers; in-house staff are therefore not familiar with them. So CIOs are called upon to stimulate innovation in their IT departments and other divisions, too.
Step by step to a big data strategy
No matter what the current availability of out-of-the-box solutions is, companies need to develop a strategy to make appropriate use of the data they have – and they need to do this at an early stage. A data due diligence can help answer the key strategic questions.
A checklist:
Visualization plays a pivotal role here: in current business practice, reports help communicate analysis results to decision makers in a standardized form. But analysts developing big data applications do not access standardized reports –they use visualization instead to help them quickly pick out statistical patterns and trends. Only in the next step can they present the customer with standard reports containing facts drawn from a range of different data sources.
For a Hadoop-based analysis, the experts do need sound knowledge of the framework itself and of the technologies around it (HDFS, HBase, Hive, Mahout). Yet the analysis is not the only part of the process that these skills are required for; they are also needed beforehand, to weigh up the big data methods. Companies do not currently have the internal resources to do this. That's because the necessary technologies have not been developed by the leading database producers; in-house staff are therefore not familiar with them. So CIOs are called upon to stimulate innovation in their IT departments and other divisions, too.
Step by step to a big data strategy
No matter what the current availability of out-of-the-box solutions is, companies need to develop a strategy to make appropriate use of the data they have – and they need to do this at an early stage. A data due diligence can help answer the key strategic questions.
A checklist:
- What challenges do we want to resolve by making use of the data?
- Why do we want to resolve these challenges? What is the business case?
- What data does the company need in order to do this?
- What data is currently available in which systems? Is it available in sufficient detail?
- Which of the data that we need is not yet systematically recorded?
- Can the missing data be generated as a byproduct of existing processes? Or do we need to develop new ways of recording it?
- Data infrastructure/architecture:
Companies must determine what the leading systems for each of their data sets will be in the future, if this is not yet defined. - Software infrastructure:
Companies must determine which methods will be used for the data analyses. Normally these will be established BI tools capable of generating standard reports from the available data. In a big data method, this software infrastructure consists of a big data platform such as Hadoop, connectors to the relevant data sourced in the data architecture, and analysis tools like Hive for data warehousing, Mahout for machine learning and Pig as an interactive shell. - Technical infrastructure:
This involves the technical infrastructure for realizing the big data system. What this means for companies is a classic make-or-buy decision. If analyses are to be carried out once only or there are major fluctuations in the volume of data or the demand for analyses, it might make more sense for them to use cloud-based infrastructures than to invest in their own hardware. The business case developed as part of the data due diligence can shed light on that aspect.
Visit also our book "In Data We Trust - How Customer Data Is Revolutionising Our Economy" on Facebook.

