One of the real advantages of a system like Hadoop is that it runs on commodity hardware. This will keep your hardware costs low. But when that hardware fails at an unusually high rate it can really throw a wrench into your plans. This was the case recently when we set up a new cluster to collect our custom log data and experienced a high rate of hard drive failures. Here is what happened in about one week’s time:
- We set up a new 27 node cluster, installed Hadoop 2.0, got the system up and running and started loading log files.
- By Friday, (two days later) the cluster was down to 20 functioning nodes as data nodes began to fall out due to hard drive failures. The primary name node had failed over to the secondary name node.
- By Monday, the cluster was down to 12 nodes and the name node had failed over.
- On Wednesday the cluster was at 6 nodes and we had to shut it down.
- The failures coincided with the increased data load on the system. As soon as we started ingesting our log data, putting pressure on the hard drives, the failures started.
It makes you wonder what happened during the manufacturing process. Did a forklift drop a palette of hard drives and those drives were the ones installed into the machines sent to us? Did the vendor simply skip the quality control steps for this batch of hard drives? Did someone on the assembly line sneeze on the drives? Did sun spots cause this? Over 20% of the hard drives in this cluster had to be replaced in the first three weeks that this system was running. There were three or more nodes failing daily for a while. We started running scripts that looked at the S.M.A.R.T. monitoring information for the hard drives. Any drives that reported failures or predicted failures were identified and replaced. We had to proactively do this on all nodes in the cluster.
One interesting side note about Hadoop, our system never lost data. The HDFS file system check showed that replication had failed but we had at least one instance of every data block. As we rebuilt the cluster, the data was replicated three times.
What are we doing about this? First, we are having the vendor who is staging our hardware run a set of diagnostics before sending the hardware to us. It is no longer “good enough” to make sure the systems power on. If problems are found, they will swap out the hardware before we receive them. Second, we’ve set minimum failure standards for our hardware and keep track of failures. If we’re seeing too many failures, work proactively with the vendor on replacement hardware.
One of my Hadoop engineers said this, “If you purchase commodity memory, it may be slower but it runs. If you purchase commodity CPUs, they also run. If you purchase the least expensive commodity hard drives, they will fail.” He’s absolutely right.
If all else fails, get different commodity disk drives (enterprise level).