Successfully utilizing Big Data has become a critical element for many organizations, including the formulation of platform strategy, be it a “Data Hub,” “Data Platform” or “Data Lake”.
By Guruswamy Ganesh
As Big Data and analytics gain traction across the business landscape, enterprises are looking for ways to make data work harder for them. Indian Big Data analytics and business intelligence market is forecast by IDC to grow at a 26.4 per cent compound annual growth rate to $41.5 billion through 2018. In 2015, the development of information, mobile, social media technologies and cloud has altered the market significantly and brought about a shift toward self-service, cloud and analytics applications tailored for business users and information workers. However the fact remains that Big Data is big and complex. Not only in the accumulation of information, but also in its impact on business strategy. Successfully utilizing Big Data has become a critical element for many organizations, including the formulation of platform strategy, be it a “Data Hub,” “Data Platform” or “Data Lake”.
The truth is that Big Data is hard to do. According to Gartner, through 2018, 70 percent of Hadoop deployments are predicted to fail to meet cost savings and revenue generation objectives due to skills and integration challenges. So how do you do Big Data ‘right’? Here are the most common Big Data pitfalls you should avoid:
Dismissing the need for Enterprise Platform or Data-Centric Architecture
Enterprises must begin with an enterprise platform strategy and a data-centric architecture to break down the debilitating silos rampant in organizations of all sizes. Big Data requires the ability to parallel process, with as little friction as possible, in a completely scalable and distributed environment. Unlike in traditional database systems or isolated application islands, in a data-centric architecture or enterprise platform data is not restrained, schema bound and locked.
Absence of a vision for a “Data Lake”
“Data Lakes” are massive, easily accessible, centralized repositories of large volumes of structured and unstructured data. This is in addition to internal, external and partner data. The Data Lake repository provides powerful benefits through the “economics of Big Data,” with up to 30x to 50x lower costs to store and analyze data in comparison to traditional setups. The Data Lake can capture “as-is” or “raw data” prior to any data transformation or schema creation before capturing the data, with automated rapid ingest mechanisms in place. The Data Lake plays a pivotal role in the journey towards connecting enterprise data together with seamless data access, iterative algorithm development and agile deployment.
Not forecasting Data Growth or Levels of Maturity
When the Data Lake becomes the default data destination, governance and fine-grained security become of pivotal importance from the get go. Meta data access and storage along with data lineage and annotations become built in. Raw data and various stages of transformed data can all live side by side without any conflict. Applications can use each other’s data via Hadoop. External data can be shielded or integrated based on explicit processing/analytics requirements and variable data sets all live harmoniously on the Data Lake leading to increased data availability with decreased time for application deployment and unlimited scalability and growth.
Using small Data sets for analysis
Many hold the assumption that data doesn’t necessarily need to be united, and that one can work with small sample sets of data. This is a dangerous misconception, as the results are often extrapolated to larger data sets, and variances are not accounted for, which leads to at least misleading or, more likely, even deeply skewed results. It’s often called the curse of small sample data set analysis. For example when you work with small sample data set, you might come across many outliers or anomalies. With the small sample data set, there is no way of knowing that the anomaly is actually structural when you have larger data set, or the outliers are indeed a pattern with a definite signature.
Less Data on Sophisticated Algorithms beat the purpose
Another misconception is that advanced and complex algorithms will solve all the problems. However that’s not always the case. Computers, since they operate on logical processes, will unquestioningly process unintended, even nonsensical, input data, and produce undesired, often nonsensical, output. In information and computer science, this is called “garbage in, garbage out” when it comes to uncleansed data being fed to complex algorithms. Missing/sparse data, null values, and human errors, must all be cleansed. Avoid relying on un-proven assumptions or weak co-relations. Instead, collect as much data as possible and let the data speak for itself. This is very cost effective with the implementation of a data platform.
Many who haven’t yet implemented a Big Data project are assessing their data strategy for 2016, while others are looking at current undertakings and examining new ways of leveraging analytics to improve business operations and increase revenue streams. Avoiding the above mentioned pitfalls is one of the best ways to start off the year on the right Big Data note.
The author is Vice President, Corporate Engineering, SanDisk India