Tech Roots » Distributed Computing http://blogs.ancestry.com/techroots Ancestry.com Tech Roots Blogs Fri, 31 Jul 2015 22:24:36 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.2 Monitoring progress of SOA HPC jobs programmaticallyhttp://blogs.ancestry.com/techroots/monitoring-progress-of-soa-hpc-jobs-programmatically/ http://blogs.ancestry.com/techroots/monitoring-progress-of-soa-hpc-jobs-programmatically/#comments Fri, 17 Oct 2014 14:15:27 +0000 http://blogs.ancestry.com/techroots/?p=2873 Here at Ancestry.com, we currently use Microsoft’s High Performance Computing (HPC) cluster to do a variety of things.  My team has multiple things we use an HPC cluster for.  Interestingly enough, we don’t communicate with HPC exactly the same for any distinct job type.  We’re using the Service Oriented Architecture (SOA) model for two of… Read more

The post Monitoring progress of SOA HPC jobs programmatically appeared first on Tech Roots.

]]>
Here at Ancestry.com, we currently use Microsoft’s High Performance Computing (HPC) cluster to do a variety of things.  My team has multiple things we use an HPC cluster for.  Interestingly enough, we don’t communicate with HPC exactly the same for any distinct job type.  We’re using the Service Oriented Architecture (SOA) model for two of our use cases, but even those communicate differently.

Recently, I was working on a problem where I wanted our program to know exactly how many tasks in a job had completed (not just the percentage of progress), similar to what can be seen in HPC Job manager.  The code for these HPC jobs uses the BrokerClient to send tasks.  With the BrokerClient, you can “fire and forget”, which is what this solution does.  I should note that the BrokerClient can retrieve results, after the job is finished, but that wasn’t my use case.  I thought there should be a simple way to ask HPC how many tasks had completed.  It turns out that this is not as easy as you might expect, when using the SOA model.  I couldn’t find any documentation on how to do it.  I found a solution that worked for me, and I thought I’d share it.

HPC Session Request Breakdown, as shown in HPC Job Manager

HPC Session Request Breakdown, as shown in HPC Job Manager

With a BrokerClient, your link back to the HPC job comes from the Session object used to create the BrokerClient.  From a Scheduler, you can get your ISchedulerJob that corresponds with the Session by matching the ISchedulerJob.Id to the Session.Id.  My first thought was to use ISchedulerJob.GetTaskList() to retrieve the tasks and look at the task details.  It turns out that for SOA jobs, tasks do not correspond to requests.  The tasks don’t have any methods on them to indicate how many requests they’ve fulfilled, either.

My solution was found while looking at the results of the ISchedulerJob.GetCustomProperties() method.  I was surprised to find the solution there, since the MSDN documentation states that this is “application-defined properties”.

I found four name-value pairs which may be useful for knowing the state of tasks in a SOA job, with the following keys:

  • “HPC_Calculating”
  • “HPC_Caclulated”
  • “HPC_Faulted”
  • “HPC_PurgedProcessed”

I should note that some of these properties don’t exist when the job is brand new, with no requests sent to it yet.  Also, I was disappointed to find no key corresponding to the “incoming” requests, since some applications might not be able to calculate that themselves.

With that information, I was able to write code to monitor the SOA jobs.

With all that said, I should also say that our other SOA HPC use case monitors the state of the tasks, and is capable of more detailed real-time information.  We do this by creating our own ChannelFactory and channels.  By using that, the requests are not “fire and forget” – we get results back from each request individually as it completes.  We know how many outstanding requests there are, and how many have completed.  If we wanted to, we could use the same solution presented for the BrokerClient to find out how many are in the “calculating” state.

One last disclaimer:  These “Custom Properties” are not documented, but they are publicly exposed.  Microsoft could change them.  If they ever do, I hope they would consider it a breaking change, and document it.  There are no guarantees of that, so use discretion when considering this solution.

The post Monitoring progress of SOA HPC jobs programmatically appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/monitoring-progress-of-soa-hpc-jobs-programmatically/feed/ 2
On Track to Data-Drivenhttp://blogs.ancestry.com/techroots/on-track-to-data-driven/ http://blogs.ancestry.com/techroots/on-track-to-data-driven/#comments Wed, 25 Dec 2013 05:25:43 +0000 http://blogs.ancestry.com/techroots/?p=1697 Ancestry.com becomes more and more aware of the value of the data our website generates every single day. We have a lot of customers coming to the website to discover, preserve and share their family history. They come from different parts of the world and are looking for information that helps them tell the story… Read more

The post On Track to Data-Driven appeared first on Tech Roots.

]]>
Ancestry.com becomes more and more aware of the value of the data our website generates every single day. We have a lot of customers coming to the website to discover, preserve and share their family history. They come from different parts of the world and are looking for information that helps them tell the story of their ancestors’ lives and to learn more about themselves. We certainly acknowledge and to some extent “celebrate the heterogeneity” of our customers, as Professor Fader from Wharton School once stated as a marketing principle in his Coursera course as he mentioned that relying on data is the key. But where do we start?

Most companies begin their big data journey by capturing customer data, and soon run into the problem of how to store and process their humongous set of data. Smart people from Google had great ideas on how to solve this problem via MapReduce and Google File System, and soon people from Yahoo! and other companies caught up with an open source version called Hadoop. We are using Hadoop at Ancestry.com, and I am glad we were not the ones running into a Big Data problem without a solution in the first place, though we have had to develop our own innovative methods to scale to handle a growing business and content.

We are getting more and more familiar with Hadoop from the success we’ve had with the AncestryDNA project, which I covered in a previous blog post.  Now, we have the place to store and process the data, which is great, but we have to ask ourselves, did we capture all of the necessary the data? In our case, the answer is yes and no. We started by looking at the existing data we had, and much of it was webserver logging. It only gives us limited information about page visits. Thanks to the Ancestry Framework team, we are now able to capture logging events through aspects (AOP) and are able to stitch the information from multiple stacks that collectively serve a single end user web request.

How would we get the newly instrumented logging data into Hadoop? Let’s take a look at how Hadoop was adopted at its infancy. Historically, people would drop the data into Hadoop and run long-running MapReduce programs to process the data. Very soon, those people realized they needed something faster than a batch processing job. Many stream processing frameworks were invented as a result.

At Ancestry.com we’re working towards real-time processing in Hadoop as well. After looking at a few frameworks, we chose Apache Kafka to stream the data into Hadoop. (Again, thanks to those who ran into difficult problems and created these elegantly designed systems.) With help from the LinkedIn‘s Kafka and Hadoop folks, along with the open source community, we were able to go live with Kafka on a few major stacks collecting data just before Christmas this year. We then plan to have a full scale rollout to the rest of the stacks. This would achieve our need for handling the massive amounts of data by pumping it into Hadoop, as well as better positioning us for real-time stream processing, which might come next.

We are hoping all this will help our Analytics group and Data Science team to understand user data better in order to improve our customers’ experience. After a quick holiday break, the data engineering team will be busy again to ensure 2014 is on track to become a data-driven year.

Happy Holidays!

The post On Track to Data-Driven appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/on-track-to-data-driven/feed/ 0
Throttling Image Processinghttp://blogs.ancestry.com/techroots/throttling-image-processing/ http://blogs.ancestry.com/techroots/throttling-image-processing/#comments Fri, 21 Jun 2013 14:30:05 +0000 http://blogs.ancestry.com/techroots/?p=760 Ancestry.com, like any other site with millions of subscribers, experiences predictable load patterns throughout the day. To maximize site performance and customer satisfaction, we make every effort to schedule maintenance during off-peak intervals. Content processing, especially our repository of hundreds of millions of images, on the other hand, is a constant ongoing effort, and in… Read more

The post Throttling Image Processing appeared first on Tech Roots.

]]>
Ancestry.com, like any other site with millions of subscribers, experiences predictable load patterns throughout the day. To maximize site performance and customer satisfaction, we make every effort to schedule maintenance during off-peak intervals.

Content processing, especially our repository of hundreds of millions of images, on the other hand, is a constant ongoing effort, and in some cases must be done on live content being served up to our customers. One example of this occurs when we roll an improved set of images for a given collection, such as the 1921 Census of Canada, to the live site. Many of these images may have different dimensions than the originally published images. To be sure we get it right, we double check every image in the collection.

Until now, this work was done with a desktop tool that was effective but could take days to complete its work on very large collections. In order to speed this up, the Enterprise Media Team’s distributed computing initiative created a new service that uses a light weight, open source distributed computing framework called DuoVia.MpiVisor, a project led by this author outside of his regular Ancestry.com responsibilities, to distribute the work on five servers with a total of 64 logical processors.

Distributing the work on 64 logical processors was enormously successful, verifying up to 50,000 images’ dimensions every minute. The challenge was that if we were to allow content management access to this very powerful tool at any time during the day, there was a distinct possibility that it would affect the performance of our live site, something we wanted very much to avoid.

To throttle the new image dimension populating (IDP) service, we created three time zones to define high, medium and low traffic periods during the day. During high traffic periods, we only allowed one third of the processing agents to be given work. And during medium traffic periods, only one half of the available processing agents are used. Of course, during off-peak periods, all available agents are utilized.

In the weeks since the IDP service launched, it has processed over 130 million images in just over 6,700 run-time minutes. That is a throttled average of about 19,000 images processed per minute of processing time, far below its current max potential of 50,000 per minute.

By throttling the work, the IDP service remains responsive during peak traffic times without impacting the customer experience, allowing content teams to continue working to deliver the best images as soon as humanly possible to our customers.

The post Throttling Image Processing appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/throttling-image-processing/feed/ 1
Distributed Parallel Computing at Ancestry.comhttp://blogs.ancestry.com/techroots/distributed-parallel-computing-at-ancestry-com/ http://blogs.ancestry.com/techroots/distributed-parallel-computing-at-ancestry-com/#comments Wed, 24 Apr 2013 23:03:40 +0000 http://blogs.ancestry.com/techroots/?p=535 About 450 years ago John Heywood wrote, “many hands make light work.” The same can be said of image and data processing. Distributed parallel computing (DPC) makes it possible for us to do the work described by Michael Murdock in his series on the image processing pipeline. If you haven’t already, take a moment to… Read more

The post Distributed Parallel Computing at Ancestry.com appeared first on Tech Roots.

]]>
About 450 years ago John Heywood wrote, “many hands make light work.” The same can be said of image and data processing. Distributed parallel computing (DPC) makes it possible for us to do the work described by Michael Murdock in his series on the image processing pipeline. If you haven’t already, take a moment to read his excellent posts.

At Ancestry.com we use a DPC system developed in-house that we call “iFarm.” We also use more recognizable DPC systems such as Hadoop for some things, but our primary image processing pipeline, described by Michael, runs on the iFarm.

The iFarm’s Client Controller allows us to monitor and control the servers and task agents in the “farm” of servers processing tasks. It also allows us to roll new task code to each of the client nodes when a change is made to the code.

iFarm Client Controller

The iFarm Client Controller – Allows us to manage servers and agents remotely.

In addition to the  image processing pipeline, and as the need arises, the Enterprise Media Team (EMT) creates and runs a series of image and data correction modules on already published images and data. We call this series of modules the Media Validation Processor (MVP). Probably the most significant MVP module is our Deep Zoom pre-processing module.

About 18 months ago Ancestry.com introduced its Deep Zoom image viewing technology. This allows our users to zoom in and out on hard to read, historical records or images in a record collection, such as the 1940 Census, with very little if any delay in loading the image. In order to achieve best performance results, this technology requires that the original image be specially processed into what we call “tiles.”

Viewing Deep Zoom processed images can be rather CPU intensive for the application server. This processing burden can be reduced greatly when the image has been pre-processed into tiles. The image processing pipeline automatically performs Deep Zoom pre-processing on new collections and updates to existing collections. But that leaves hundreds of millions of images that have not been pre-processed because they were published previous to the release of our Deep Zoom technology.

This is where the MVP Deep Zoom modules running on multiple agents across multiple server nodes recently came into play. Even with multiple iFarm server nodes and many agents running 24/7, the pre-processing of images for Deep Zoom in our top 500 most actively used collections required several months to complete. If not for the advantages of DPC in our iFarm system, this project could have taken years to complete. Eventually all of our collection titles will be pre-processed for Deep Zoom using iFarm.

If Heywood were a TechRoots blogger today, he would write, “Many CPUs make light work.” At Ancestry.com we are always looking for ways to achieve more in less time using the power of distributed parallel computing.

The post Distributed Parallel Computing at Ancestry.com appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/distributed-parallel-computing-at-ancestry-com/feed/ 0