About 450 years ago John Heywood wrote, “many hands make light work.” The same can be said of image and data processing. Distributed parallel computing (DPC) makes it possible for us to do the work described by Michael Murdock in his series on the image processing pipeline. If you haven’t already, take a moment to read his excellent posts.
At Ancestry.com we use a DPC system developed in-house that we call “iFarm.” We also use more recognizable DPC systems such as Hadoop for some things, but our primary image processing pipeline, described by Michael, runs on the iFarm.
The iFarm’s Client Controller allows us to monitor and control the servers and task agents in the “farm” of servers processing tasks. It also allows us to roll new task code to each of the client nodes when a change is made to the code.
In addition to the image processing pipeline, and as the need arises, the Enterprise Media Team (EMT) creates and runs a series of image and data correction modules on already published images and data. We call this series of modules the Media Validation Processor (MVP). Probably the most significant MVP module is our Deep Zoom pre-processing module.
About 18 months ago Ancestry.com introduced its Deep Zoom image viewing technology. This allows our users to zoom in and out on hard to read, historical records or images in a record collection, such as the 1940 Census, with very little if any delay in loading the image. In order to achieve best performance results, this technology requires that the original image be specially processed into what we call “tiles.”
Viewing Deep Zoom processed images can be rather CPU intensive for the application server. This processing burden can be reduced greatly when the image has been pre-processed into tiles. The image processing pipeline automatically performs Deep Zoom pre-processing on new collections and updates to existing collections. But that leaves hundreds of millions of images that have not been pre-processed because they were published previous to the release of our Deep Zoom technology.
This is where the MVP Deep Zoom modules running on multiple agents across multiple server nodes recently came into play. Even with multiple iFarm server nodes and many agents running 24/7, the pre-processing of images for Deep Zoom in our top 500 most actively used collections required several months to complete. If not for the advantages of DPC in our iFarm system, this project could have taken years to complete. Eventually all of our collection titles will be pre-processed for Deep Zoom using iFarm.
If Heywood were a TechRoots blogger today, he would write, “Many CPUs make light work.” At Ancestry.com we are always looking for ways to achieve more in less time using the power of distributed parallel computing.
About Tyler Jensen
Tyler Jensen is a senior software engineer in R&D at Ancestry.com. He has worked in the software industry since 1992. He loves to solve difficult technical challenges. When he's not working or writing or reading, he enjoys spending time with his wife and four children.