Successful implementation of a Big Data strategy, where huge and diverse data sets become the basis for a whole host of ways to improve an organization, would not be possible without a way to analyze the data. Apache Hadoop has emerged as perhaps the most popular way to harness Big Data and turn it into something useful.

Several companies are reaping significant rewards from implementing Hadoop, which is an open source framework that stores and runs applications on commodity software to analyze large data sets across clusters of computers. Many leaders think Hadoop is a new and emerging technology. But the Hadoop framework was spun off from the Nutch search engine project in 2005 by two gentlemen at Yahoo, Doug Cutting and Mike Cafarella. Hadoop is written in the Java programming language, which has been used by web applications for more than twenty years. Version 3.0 was released in late 2017.

"Grid computing has been around for more than a decade. Java and grid computing, coupled together, have proven the maturity on which Hadoop is based," said Larry Steele, TDK Technologies Director of Data Solutions.

Big Players Help Hadoop in order to Help Themselves

Another significant insight into the maturity level of Apache Hadoop is to assess who is contributing to the project. Yahoo, Facebook, Twitter, LinkedIn, Intel, Microsoft, Cloudera and Hortonworks, to name a few, have multiple developers contributing innovative software that strengthens Hadoop in every release. Due to the size of the organizations and the massive volume of data collected by many of these companies, they had to find a way to solve their own data collection problems. They also had to find better ways to build solutions to their unique data analytics problems. Finally, through countless hours of development, commitment and usage, they’ve worked together to create a mature platform that all companies can leverage.

If Hadoop is more mature than what may be perceived, what makes it different? To start with, it’s the design principles. A few basic design principles in Hadoop include:

  • Leveraging commodity server hardware
  • Moving the processing to the data
  • Utilizing local hard drives

Hadoop Design Principles

Let’s look a little deeper at these design principles. Hadoop implementations should begin with commodity hardware containing six core processors, 96 gigabytes of memory and as many one- to four-terabyte local hard drives as will fit in each chassis. While this configuration is a design principle, it’s not absolute.

"For implementations where response time is critical, larger processors, more memory and solid state disks are required. Still, utilizing commodity hardware provides a much more efficient cost model and flexibility for scaling out large solutions and use cases," said TDK Chief Technology Officer Mark Henman.

The next Hadoop design principle suggests that instead of separating the processing from the data storage, move the processing to where the data is stored. This allows for faster, more efficient processing of the data. Application queries are not required to access a remote disk in order to complete their task. This improves performance and decreases complexity. This also leads us into the next design principle.

Hadoop performs and scales better with local hard drives and can be configured to use Just a Bunch of Disks (JBOD) rather than Redundant Array of Independent Disks (RAID). By default, Hadoop creates three copies of the data and replicates the data across multiple servers and drives. By using large (three- to six-terabyte) spinning disk drives, Hadoop takes into account disk failure and resubmits existing queries without interruption.

"Hadoop handles the mirroring or replication of the data as it’s written across the cluster, thereby eliminating the need for RAID," Henman said.

Simply put, this is all about economics. Over the last twenty plus years, companies have spent an enormous amount of money separating the processor and memory from the data storage layer through Storage Area Networks (SAN) or Network Attached Storage (NAS). These technologies provide many benefits, but also come at a high price. A simple and effective measurement to determine if Hadoop is right for your company is to calculate the cost of storage and processing using the design principles and compare it to the cost of the legacy approach to managing data.

An IT pro who'll take the time to learn my business. Is that too much to ask?