Tech Roots http://blogs.ancestry.com/techroots Ancestry.com Tech Roots Blogs Mon, 25 Aug 2014 23:26:34 +0000 en-US hourly 1 http://wordpress.org/?v=3.5.2 The DNA matching research and development life cyclehttp://blogs.ancestry.com/techroots/the-dna-matching-research-and-development-life-cycle/ http://blogs.ancestry.com/techroots/the-dna-matching-research-and-development-life-cycle/#comments Tue, 19 Aug 2014 20:24:30 +0000 Julie Granka http://blogs.ancestry.com/techroots/?p=2672 Research into matching patterns of over a half-million AncestryDNA members translates into new DNA matching discoveries  Among over 500,000 AncestryDNA customers, more than 35 million 4th cousin relationships have been identified – a number that continues to grow rapidly at an exponential rate.  While that means millions of opportunities for personal discoveries by AncestryDNA members,… Read more

The post The DNA matching research and development life cycle appeared first on Tech Roots.

]]>
Research into matching patterns of over a half-million AncestryDNA members translates into new DNA matching discoveries 

Among over 500,000 AncestryDNA customers, more than 35 million 4th cousin relationships have been identified – a number that continues to grow rapidly at an exponential rate.  While that means millions of opportunities for personal discoveries by AncestryDNA members, it also means a lot of data that the AncestryDNA science team can put back into research and development for DNA matching.

At the Institute for Genetic Genealogy Annual Conference in Washington, D.C. this past weekend, I spoke about some of the AncestryDNA science team’s latest exciting discoveries – made by carefully studying patterns of DNA matches in a 500,000-member database.

 

Graph showing growth in the number of 4th cousin matches between pairs of AncestryDNA customers over time

Graph showing growth in the number of 4th cousin matches between pairs of AncestryDNA customers over time

DNA matching means identifying pairs of individuals whose genetics suggest that they are related through a recent common ancestor. But DNA matching is an evolving science.  By analyzing the results from our current method for DNA matching, we have learned how we might be able to improve upon it for the future.

 

Life cycle of AncestryDNA matching research and development

Life cycle of AncestryDNA matching research and development

The science team targeted our research of the DNA matching data so that we could obtain insight into two specific steps of the DNA matching procedure.

Remember that a person gets half of their DNA from each of their parents – one full copy from their mother and one from their father.  The problem is that your genetic data doesn’t tell us which parts of your DNA you inherited from the same parent.  The first step of DNA matching is called phasing, and determines the strings of DNA letters that a person inherited from each of their parents.  In other words, phasing distinguishes the two separate copies of a person’s genome.

 

Observed genetic data only reveals the pairs of letters that a person has at a particular genetic marker.  Phasing determines which strings of letters of DNA were inherited as a unit from each of their parents.

Observed genetic data only reveals the pairs of letters that a person has at a particular genetic marker. Phasing determines which strings of letters of DNA were inherited as a unit from each of their parents.

If we had DNA from everyone’s parents, phasing someone’s DNA would be easy.  But unfortunately, we don’t.  So instead, phasing someone’s DNA is often based on a “reference” dataset of people in the world who are already phased.  Typically, those reference sets are rather small (around one thousand people).

Studies of customer data led us to find that we could incorporate data from hundreds of thousands of existing customers into our reference dataset.  The result?  Phasing that is more accurate, and faster.  Applying this new approach would mean a better setup for the next steps of DNA matching.

The second step in DNA matching is to look for pieces of DNA that are identical between individuals.  For genealogy research, we’re interested in DNA that’s identical because two people are related from a recent common ancestor.  This is called DNA that is identical by descent, or IBD.  IBD DNA is what leads to meaningful genealogical discoveries: allowing members to connect with cousins, find new ancestors, and collaborate on research.

But there other reasons why two people’s DNA could be identical. After all, the genomes of any two humans are 99.9% identical. Pieces of DNA could be identical between two people because they are both human, because they are of the same ethnicity, or because they share some other more ancient shared history.  We call these pieces of DNA only identical by state (IBS), because the DNA could be identical for a reason other than a recent common ancestor.

We sought to understand the causes of identical pieces of DNA between more than half a million AncestryDNA members.  Our in-depth study of these matches led us to find that in certain places of the genome, thousands of people were being estimated to have DNA that was identical to one another.

What we found is that thousands of people all having matching DNA isn’t a signal of all of them being closely related to one another.  Instead, it’s likely a hallmark of a more ancient shared history between those thousands of individuals – or IBS.

 

Finding places in the genome where thousands of people all have identical DNA is likely a hallmark of IBS, but not IBD.

Finding places in the genome where thousands of people all have identical DNA is likely a hallmark of IBS, but not IBD.

In other words, our analysis revealed that in a few cases where we thought people’s DNA was identical by descent, it was actually identical by state.  These striking matching patterns were only apparent after viewing the massive amount of matching data that we did.

So while the data suggested that our algorithms had room for improvement, that same data gave us the solution.  After exploring a large number of potential fixes and alternative algorithms, we discovered that the best way to address the problem was to use the observed DNA matches to determine which were meaningful for genealogy (IBD) – and distinguish them from those due to more ancient shared history.  In other words, the matching data itself has the power to help us tease apart the matches that we want to keep from those that we want to throw away.

The AncestryDNA science team’s efforts – poring through mounds and mounds of DNA matches – have paid off.  From preliminary testing, it appears that these latest discoveries relating to both steps of DNA matching may lead to dramatic DNA matching improvements. In the future, this may translate to a higher-quality list of matches for each AncestryDNA member: fewer false matches, and a few new matches too.

In addition to the hard work of the AncestryDNA science team, the huge amount of DNA matching data from over a half-million AncestryDNA members is what has enabled these new discoveries.  Carefully studying the results from our existing matching algorithms has now allowed us to complete the research and development “life cycle” of DNA matching: translating real data into future advancements in the AncestryDNA experience.

The post The DNA matching research and development life cycle appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/the-dna-matching-research-and-development-life-cycle/feed/ 4
Core Web Accessibility Guidelineshttp://blogs.ancestry.com/techroots/core-web-accessibility-guidelines/ http://blogs.ancestry.com/techroots/core-web-accessibility-guidelines/#comments Wed, 13 Aug 2014 22:08:39 +0000 Jason Boyer http://blogs.ancestry.com/techroots/?p=2655 How do you ensure accessibility on a website that is worked on by several hundred web developers? That is the question we are continually asking ourselves and have made great steps towards answering. The approach we took was to document our core guidelines and deliver presentations and trainings to all involved. This included our small… Read more

The post Core Web Accessibility Guidelines appeared first on Tech Roots.

]]>
How do you ensure accessibility on a website that is worked on by several hundred web developers?

That is the question we are continually asking ourselves and have made great steps towards answering. The approach we took was to document our core guidelines and deliver presentations and trainings to all involved. This included our small team of dedicated front-end web developers but also the dozens of back-end developer teams that also work within the front-end. This article will be the first in a series going in-depth on a variety of web accessibility practices.

Our following core guidelines, though encompassing hundreds of specific rules, have helped focus our accessibility efforts.

A website should:

  • Be Built on Semantic HTML
  • Be Keyboard Accessible
  • Utilize HTML5 Landmarks and ARIA Attributes
  • Provide Sufficient Contrast

Our internal documentation goes into detail as to why these guidelines are important and how to fulfill each requirement. For example, semantic HTML is important because it allows screen readers and browsers to properly interpret your page and helps with keyboard accessibility. Landmarks are important for they allows users with screen readers to navigate over blocks of content. Contrast is important because people need to be able to see your content!

Do our current pages meet all of these requirements? Nope. That’s why we’ve documented them so that we can provide structure to this effort and have measurable levels of success.

We have been learning a lot about accessibility during the past few months. The breadth of this topic is amazing. Lots of good people in the web community have made tremendous efforts in helping others learn. W3C’s Web Accessibility Initiative documentation and WebAIM’s explanations are definitely worth the time to study.

In my following posts, I will outline many of the rules with practical examples for each of our core guidelines. Some of the items that I’ll describe are:

  • Benefits of Semantic HTML
    Should all form elements be wrapped in a form element? Do I need to use the role attribute on HTML5 semantic elements? How do decide between these: <a> elements vs <button> elements, input[type="number"] vs input[type="text"], <img> elements vs CSS backgrounds, and more.
  • Keyboard Accessibility 101
    Should I ever use [tabindex]? Why/when? How should you handle showing hidden or dynamically loaded content? Should the focus be moved to popup menus?
  • HTML5 Landmarks
    Why I wish landmarks were available outside of screen reader software. Which landmarks should I use? How do I properly label them?

Web accessibility is the reason semantic HTML exists. Take the time to learn how to make your HTML, CSS, and JavaScript accessible. If you’re going to take the time to create an HTML page, may as well do it correctly.

The post Core Web Accessibility Guidelines appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/core-web-accessibility-guidelines/feed/ 0
Migrating From TFS to Git-based Repositories (II)http://blogs.ancestry.com/techroots/migrating-from-tfs-to-git-based-repositories-ii/ http://blogs.ancestry.com/techroots/migrating-from-tfs-to-git-based-repositories-ii/#comments Fri, 08 Aug 2014 20:38:59 +0000 Seng Lin Shee http://blogs.ancestry.com/techroots/?p=2579 Previously, I wrote about why Git-based repositories have become popular and why TFS users ought to migrate to Git. In this article, I would like to take a stab at providing a quick guide for longtime TFS / Visual Studio users to quickly ramp up on the knowledge required to work on Git-based repositories. This… Read more

The post Migrating From TFS to Git-based Repositories (II) appeared first on Tech Roots.

]]>
Previously, I wrote about why Git-based repositories have become popular and why TFS users ought to migrate to Git.

In this article, I would like to take a stab at providing a quick guide for longtime TFS / Visual Studio users to quickly ramp up on the knowledge required to work on Git-based repositories. This article will try to present Git usage based on the perspective of a TFS user. Of course, there may be some Git-only concepts, but I will try my best to lower the learning curve for the reader.

I do not intend to thoroughly explore the basic Git concepts. There are very good tutorials out there with amazing visualizations (e.g. see Git tutorial). However, this is more like a no-frills quick guide for no-messing-around people to quickly get something done in a Git-based world (Am I hitting a nerve yet? :P ).

Visual Studio has done a good job abstracting the complex commands behind the scenes, though I would highly recommend going through the nitty-gritty details of each Git command if you are vested in using Git for the long term.

For this tutorial, I only require that you have one of the following installed:

  1. Visual Studio 2013
  2. Visual Studio 2012 with Visual Studio Tools for Git

Remapping your TFS-trained brain to a Git-based one

Let’s compare the different approaches between TFS and Git.

TFS Terminology

Tfs flow

  1. You will start your work on a TFS solution by synchronizing the repository to your local folder.
  2. Every time you modify a file, you will check out that file.
  3. Checking-in a file commits the change to the central repository; hence, it requires all contributors who are working on that file to ensure that conflicts have been resolved.
  4. The one thing to note is that TFS keeps track of the entire content of files, rather than the changes made to the contents.
  5. Additionally, versioning and branching requires developers to obtain special rights to the TFS project repository.

Git Terminology

Git flow

If you are part of the core contributor group (left part of diagram):

  1. Git, on the other hand, introduces the concept of a local repository. Each local repository represents a standalone repository that allows the contributor to continue working, even when there is no network connection.
  2. The contributor is able to commit work to the local repository and create branches based on the last snapshot taken from the remote repository.
  3. When the contributor is ready to push the changes to the main repository, a sync is performed (pull, followed by a push). If conflicts do occur, a fetch and a merge are performed which requires the contributor to resolve conflicts.
  4. Following conflict resolution, a commit is performed against the local repository and then finally a sync back to the remote repository.
  5. The image above excludes the branching concept. You can read more about it here.

If you are an interested third party who wants to contribute (right part of diagram):

  1. The selling point of Git is the ability for external users (who have read-only access) to contribute (with control from registered contributors).
  2. Anyone who has read-only access is able to set up a Personal Project within the Git service and fork the repository.
  3. Within this project, the external contributor has full access to modify any files. This Personal Project also has a remote repository and local repository component. Once ready, the helpful contributor may launch a pull request against the contributors of the main Team Project (see above).
  4. With Git, unregistered contributors are able to get involved and contribute to the main Team Project without risking breaking the main project.
  5. There can be as many personal projects forking from any repositories as needed.
  6. *It should be noted that any projects (be it Personal or Team Projects) can step up to be the main working project in the event that the other projects disappear/lose popularity. Welcome to the wild world of open source development.

Guide to Your Very First Git Project

Outlined below are the steps you will take to make your very first check-in to a Git repository. This walkthrough assumes you are new to Git but have been using Visual Studio + TFS for a period of time.

Start from the very top and make your way to the bottom by trying out different approaches based on your situation and scenario. These approaches are the fork and pull (blue) and the shared repository (green) models. I intentionally present the feature branching model (yellow) (which I am not elaborating in this article) to show the similarities. You can read about these collaborative development models here.

Feel free to click on any particular step to learn more about it in detail.


Git Guide

Migrating from TFS to Git-based Repository

  1. Create a new folder for your repository. TFS and Git temp files do not play nicely with each other. The following would initialize the folder with the relevant files for Visual Studio projects and solutions (namely .ignore and .attribute files).
    Git step 1
  2. Copy your TFS project over to this folder.
  3. I would advise running the following command to remove “READ ONLY” flags for all files in the folder (this is automatically set by TFS when files are not checked out).
    >attrib -R /S /D *
  4. Click on Changes.
    Git step 2
  5. You will notice the generate files (.gitattributes & .gitignore). For now, you want to add all the files that you have just added. Click Add All under the Untracked Files drop down menu.
    Git step 3
  6. Then, click the Unsynced Commits.
  7. Enter the URL of the newly created remote repository. This URL is obtained from the Git service when creating a new project repository.
    Git step 4
  8. You will be greeted with the following prompt:
    Git step 5
  9. You will then see Visual Studio solution files listed in the view. If you do not have solution files, then unfortunately, you will have to rely on tools such as the Git command line or other visualization tools.

Back to index

Clone Repository onto Your Local Machine

  1. Within Visual Studio, in the Team Explorer bar,
    Git step 6

    1. Click the Connect to Team Projects.
    2. Clone a repository by entering the URL of the remote repository.
    3. Provide a path for your local repository files.
  2. You will then see Visual Studio solution files listed in the view. If you do not have solution files, then unfortunately, you will have to rely on tools such as the Git command line or other visualization tools.

Back to index

Commit the Changes

  1.  All basic operations (Create, Remove, Update, Delete, Rename, Move, Undelete, Undo, View History) are identical to the ones used with TFS.
  2. Please note that committing the changes DOES NOT affect the remote repository. This only saves the changes to your local repository.
  3. After committing, you will see the following notification:
    Git step 7

Back to index

Pushing to the Remote Repository

  1. Committing does not store the changes to the remote repository. You will need to push the change.
  2. Sync refers to the sequential steps of pulling (to ensure both local and remote repositories have the same base changes) and pushing from/to the remote repository.
    Git step 8

Back to index

Pulling from the Remote Repository and Conflict Resolution

  1. If no other contributors have added a change that conflicts with your change, you are good to go. Otherwise, the following message will appear:
    Git step 9
  2. Click on Resolve the Conflict link to bring up the Resolve Conflict page. This is similar to conflict resolution in TFS. Click on Merge to bring up the Merge Window.
    Git step 10
  3. Once you are done merging, hit the Accept Merge button.
    Git step 11
  4. Merging creates new changes on top of your existing changes to match the base change in the remote repository. Click Commit Merge, followed by Commit in order to commit this change to your local repository.
    Git step 12
  5. Now, you can finally click Sync.
    Git step 13
  6. If you see the following message, you have completed the “check-in” to your remote repository.
    Git step 14

Back to index

Forking a Repository

  1. Forking is needed when developers have restricted (read-only) access to a repository.  See Workflow Strategies in Stash for more information.
  2. One thing to note is that forking is essentially server-side cloning. You can fork any repository provided you have read access. This allows anyone to contribute and share changes with the community.
  3. There are two ways to ensure your fork stays current within the remote repository:
    1. Git services such as Stash have features that automatically sync the forked repository with the original repository.
    2. Manually syncing with the remote server
  4. You are probably wondering what the difference is between branching and forking. Here is a good answer to that question. One simple answer is that you have to be a registered collaborator in order to make a branch or pull/push an existing branch.
  5. Each Git service has its own way of creating a fork. The feature will be available when you have selected the right repository project and a branch to fork. Here are the references for GitHub and Stash respectively.
  6. Once you have forked a repository, you will have your own personal URL for the newly created/cloned repository.

Back to index

Submit a Pull Request

  1. Pull requests are useful to notify project maintainers about changes in your fork, which you want integrated into the main branch.
  2. A pull request initiates the process to merge each change from the contributors’ branch to the main branch after approval.
  3. Depending on the Git service, a pull request provides a means to conduct code reviews amongst stakeholders.
  4. Most Git services will be able to trigger a pull request within the branch view of the repository. Please read these sites for specific instructions for BitBucket, GitHub and Stash.
  5. A pull request can only be approved if there are no conflicts with the targeted branch. Otherwise, the repository will provide specific instructions to merge changes from the main repository back to your working repository.

Back to index

Summary

Git is a newer approach to version control and has been very successful with open source projects as well as with development teams who adopt the open source methodology. There are benefits for both Git and TFS repositories. Some projects may not be suitable candidates for adopting Git, whereas some are appropriate. These factors include team size, team dynamics, project cadence and requirements.

What are your thoughts about when Git should be the preferred version control for a project? What is the best approach for lowering the learning curve for long-term TFS users? How was your (or your team’s) experience in migrating towards full Git adoption? Did it work out? What Git tools do you use to improve Git-related tasks? Please share your experience in the comment section below.

 

The post Migrating From TFS to Git-based Repositories (II) appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/migrating-from-tfs-to-git-based-repositories-ii/feed/ 2
Maintaining Balance by Using Feedback Loops in Softwarehttp://blogs.ancestry.com/techroots/maintaining-balance-by-using-feedback-loops-in-software/ http://blogs.ancestry.com/techroots/maintaining-balance-by-using-feedback-loops-in-software/#comments Tue, 29 Jul 2014 17:54:00 +0000 Chad Groneman http://blogs.ancestry.com/techroots/?p=2558 Maintaining Balance by Using Feedback Loops in Software Feedback is an important part of daily life.  Used wisely, feedback loops can help us make better decisions, resulting in overall improvements.  Feedback loops are useful in all sorts of situations, from relationships to what we eat for dinner.  Software can also be made to take advantage… Read more

The post Maintaining Balance by Using Feedback Loops in Software appeared first on Tech Roots.

]]>
Maintaining Balance by Using Feedback Loops in Software

Feedback is an important part of daily life.  Used wisely, feedback loops can help us make better decisions, resulting in overall improvements.  Feedback loops are useful in all sorts of situations, from relationships to what we eat for dinner.  Software can also be made to take advantage of feedback loops.

Feedback, Feedback loops, and Apple Pie

The difference between feedback and a feedback loop is an important one.  To illustrate, let’s say I get a new apple pie recipe I want to try out.  I follow the directions exactly, and the pie comes out burnt.  I see that it’s burnt (feedback), so I throw it out rather than eat it.  But I really like apple pie, so I the next day I try again, with the same result.  Using feedback alone, I can throw the pie out again and again until the day I die.  What is needed is a feedback loop.  A feedback loop is where existing result(s) are used as input.  Where feedback prevented me from eating the burnt pie, a feedback loop can be used to make adjustments to the baking time and/or temperature.  After a few iterations of learning from the feedback, I’ll be able to enjoy a tasty apple pie, cooked to perfection.

Feedback in Software

Simple Feedback LoopThere are many forms of feedback in software, including exceptions, return values, method timings, and other measurements.  Whether it will be useful to create a feedback loop in your application is up to you – often times, there is little value in creating an automated feedback loop.  However, there are situations where feedback loops can be very helpful.  Software that is resource-bound or dependent on external systems are examples of software that can benefit from feedback loops.  Here at Ancestry.com, much work has been done to create feedback loops based on service level agreements between dependent systems.  We’ve seen a lot of benefit, but we’ve also had a learning experience or two.  If you’ve ever heard screeching because of feedback with a microphone, you know that not all feedback loops produce improved results.  Yes, it’s possible to produce bad feedback loops in software.  This can happen with a single feedback loop, but it’s especially true if you have multiple feedback loops — multiple feedback loops with bidirectional interactions can quickly become a debugging nightmare.  The more you can compartmentalize the moving parts and limit interactions, the easier maintenance will be.

A real-world example

One of the projects I work on is a system that finds and fixes problems.  Most of the jobs this system runs are low priority.  The system uses a shared compute cluster to do most of the work, and we don’t want it to use the cluster if it is loaded up with higher priority work.  Rather than queue up jobs in the cluster for days at a time, a feedback loop is used to determine whether jobs should run on the cluster or local processors.  The system monitors the cluster by sending a “ping” type task periodically, and tracking that feedback (timings, exceptions, etc.) along with the feedback from real tasks.  With that feedback, the system determines when the cluster should be used, and will dynamically change the compute destination based on that feedback.  Meanwhile, another feedback loop that tracks how much work is queued up in memory, and automatically adjusts how quickly work is produced.  We have other feedback loops around the data sources, to take action if we suspect this application is slowing down production systems.  The entire system works pretty well.

My team has seen improved performance and stability as feedback loops have been judiciously added.  We’ve found them to be worthwhile on any system we’ve applied them to, critical or not.  While it seems there’s always more we can do, it’s nice to have systems do some self-tuning to keep things efficiently running.

The post Maintaining Balance by Using Feedback Loops in Software appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/maintaining-balance-by-using-feedback-loops-in-software/feed/ 0
Building an Operationally Successful Component – Part 3: Robustnesshttp://blogs.ancestry.com/techroots/building-an-operationally-successful-component-part-3-robustness/ http://blogs.ancestry.com/techroots/building-an-operationally-successful-component-part-3-robustness/#comments Wed, 23 Jul 2014 12:00:55 +0000 Geoff Rayback http://blogs.ancestry.com/techroots/?p=2519 Building an Operationally Successful Component – Part 3: Robustness My previous two posts discussed building components that are “operationally successful.”  To me, a component cannot be considered successful unless it actually operates as expected when released into the wild.  Something that, “works on my machine,” cannot be considered a success unless it also works on… Read more

The post Building an Operationally Successful Component – Part 3: Robustness appeared first on Tech Roots.

]]>
Building an Operationally Successful Component – Part 3: Robustness

My previous two posts discussed building components that are “operationally successful.”  To me, a component cannot be considered successful unless it actually operates as expected when released into the wild.  Something that, “works on my machine,” cannot be considered a success unless it also works on the machine it will ultimately be running on.  For our team at Ancestry.com, we have found that we can ensure (or at least facilitate) operational success by following these criteria:

In the final post in this series I want to discuss how to handle problems that the component cannot correct or overcome.  I am calling this robustness, although you could easily argue that overcoming and correcting problems is also a part of robustness.  The main distinction I want to make between what I called “self-correction” and what I am calling “robustness” is that there are problems that the system can overcome and still return a correct result, and there are problems that prevent the system from providing a correct result.  My last post discussed how to move as many problems as possible into the first category, and this post will discuss what we do about the problems left over in the second.

I propose that there are three things that should happen when a system encounters a fatal error:

  1. Degraded response – The system should provide a degraded response if possible and appropriate.
  2. Fail fast – The system should provide a response to the calling application as quickly as possible.
  3. Prevent cascading failures – The system should do everything it can to prevent the failure from cascading up or down and causing failures in other systems.

Degraded Response

A degraded response can be very helpful in creating failure resistant software systems.  Frequently a component will encounter a problem, like a failed downstream dependency, that prevents it from returning a fully correct response.  But often in those cases the component may have much of the data it needed for a correct response.  It can often be extremely helpful to return that partial data to the calling application because it allows that application to provide a degraded response in turn to its clients and on up the chain to the UI layer.  Human users typically prefer a degraded response to an error.  It is usually the software in the middle that aren’t smart enough to handle them.  For example we have a service that returns a batch of security tokens to the calling application.  In many cases there may be a problem with a single token, but the rest were correctly generated.  In these cases we can provide the correct tokens to the calling application along with the error about the one(s) that failed.  To the end-user, this results in the UI displaying a set of images, a few of which don’t load, which most people would agree is preferable to an error page.  The major argument against degraded responses is that they can be confusing for the client application.  A service that is unpredictable can be very difficult to work with.  Mysteriously returning data in some cases but not in others makes for a bad experience for developers consuming your service.  Because of this, when your service responds to client applications, it is important to clearly distinguish between a full response and a partial response.  I have become a big fan of the HTTP 206 status code – “Partial Response.”  When our clients see that code, they know that there was some kind of failure, and if they aren’t able to handle a partial response, they can treat the response as a complete failure.  But at least we gave them the option to treat the response as a partial success if they are able to.

In many ways I see the failure to use degraded responses as a cultural problem for development organizations.  It is important to cultivate a development culture where client applications and services all expect partial or degraded responses.  It should be clear to all developers that services are expected to return degraded responses if they are having problems, and client applications are expected to handle degraded responses, and the functional tests should reflect these expectations.  If everyone in the stack is afraid that their clients won’t be able to handle a degraded response, then everyone is forced to fail completely, even if they could have partially succeeded.  But if everyone in the stack expects and can handle partial responses, then it frees up everyone else in the stack to start returning them.  Chicken and egg, I know, but even if we can’t get everyone on board right away, we can all take steps to push the organizations we work with in the right direction.

Fail Fast

When a component encounters a fatal exception that doesn’t allow for even a partially successful response, then it has a responsibility to fail as quickly as possible.  It is inefficient to consume precious resources processing requests that will ultimately fail.  Your component shouldn’t be wasting CPU cycles, memory, and time on something that in the end isn’t going to provide value to anyone.  And if the call into your component is a blocking call, then you are forcing your clients to waste CPU cycles, memory, and time as well.  What this means is that it is important to try to detect failure as early in your request flow as possible.  This can be difficult if you haven’t designed for it from the beginning.  In legacy systems which weren’t built this way, it can result in some duplication of validation logic, but in my experience, the extra effort has always paid off once we got the system into production.  As soon as a request comes into the system, the code should do everything it can to determine if it is going to be able to successfully process the request.  At its most basic level, this means validating request data for correct formatting and usable values.  On a more sophisticated level, components can (and should) track the status of their downstream dependencies and change their behavior if they sense problems.  If a component has a dependency which it senses is unavailable, requests that require that dependency should fail without the component even calling it.  People often refer of this kind of thing as a circuit breaker.  A sophisticated circuit breaker will monitor the availability and response times of a dependency and if the system stops responding or the response times get unreasonably long, the circuit breaker will drastically reduce the traffic it sends to the struggling system until it starts to respond normally again.  Frequently this type of breaker will let a trickle of requests through so it will be able to quickly sense when the dependency issue is corrected.  This is a great way to fail as fast as possible;in fact if you build your circuit breakers correctly, you can fail almost instantly if there is no chance of successfully processing the request.

Prevent Cascading Failures

Circuit breakers can also help implement my last suggestion, which is that a component should aggressively try to prevent failures from cascading outside of its boundaries.  In some ways this is the culmination of everything I have discussed in this post and my previous post.  If a system encounters a problem, and it fails more severely than was necessary (if it does not self-correct, and does not provide a degraded response), or the failure takes as long as, or longer than a successful request (which is actually common if you are retrying or synchronously logging the exception), then it can propagate the failure up to its calling applications.  Similarly if the system encounters a problem that results in increased traffic to a downstream dependency (think retries again, a dependency fails because it is overloaded, so the system calls it again, and again, compounding the issue), it has propagated the issue down to its dependencies.  Every component in the stack needs to take responsibility for containing failures.  The rule we try to follow on our team is that a failure should be faster and result in less downstream traffic than a success.  There are valid exceptions to that rule, but every system should start with that rule and only break it deliberately when the benefit outweighs the risk.  Circuit breakers can be tremendously valuable in these scenarios because they can make the requests fail nearly instantaneously and/or throttle the traffic down to an acceptable level as a cross-cutting concern that only has to be built once.  That is a more attractive option that building complex logic around each incoming request and each call to a downstream dependency (aspect oriented programming anyone?)

If development teams take to heart their personal responsibility to ensure that their code runs correctly in production, as opposed to throwing it over the wall to another team, or even to the customer, the result is going to be software that is healthier, more stable, more highly available, and more successful.

The post Building an Operationally Successful Component – Part 3: Robustness appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/building-an-operationally-successful-component-part-3-robustness/feed/ 1
XX+UX happy hour at Ancestry.com with guest speaker Erin Malonehttp://blogs.ancestry.com/techroots/xxux-happy-hour-at-ancestry-com-with-guest-speaker-erin-malone/ http://blogs.ancestry.com/techroots/xxux-happy-hour-at-ancestry-com-with-guest-speaker-erin-malone/#comments Mon, 21 Jul 2014 14:37:48 +0000 Ashley Schofield http://blogs.ancestry.com/techroots/?p=2537 A career journey is a curvy path that usually takes unexpected turns, and as a designer in the growing field of UX, it’s sometimes a struggle to find the right environment to foster great design discussions with fellow UXers. One of the things I’ve enjoyed most at Ancestry.com is the great team who have helped… Read more

The post XX+UX happy hour at Ancestry.com with guest speaker Erin Malone appeared first on Tech Roots.

]]>
A career journey is a curvy path that usually takes unexpected turns, and as a designer in the growing field of UX, it’s sometimes a struggle to find the right environment to foster great design discussions with fellow UXers.

One of the things I’ve enjoyed most at Ancestry.com is the great team who have helped me grow tremendously as a designer.

I’m excited to announce that on July 31, Ancestry.com will be hosting a XX+UX happy hour to foster conversations around all things UX.

Guest speaker Erin Malone, a UXer with over 20 years of experience and co-author of Designing Social Interfaces, will be sharing stories of her journey into user experience and talk about the mentors who have helped her along the way. Find out more about the event here: https://xxux-ancestry.eventbrite.com

The Google+ XX+UX community is comprised of women in UX, design, research and technology. The community shares useful industry news and hosts some of the best design events I’ve ever attended.

These events were recently written about by the +Google Design page, “We’re proud to support this burgeoning, international community of women in design, research, and technology who can connect, share stories, and mentor each other online and offline.”

Their events don’t have the typical networking awkwardness and encourage comfortable conversation. I was surprised by how much I learned from just mingling with other colleagues in various work environments—that had never happened to me at prior “networking” events.

Connecting with others and swapping stories at events like this help to develop a greater understanding of my trade and grow a network of trusted colleagues I can rely on through the twists ahead in my career.

Hope to see you at the event and hear about your career journey.

Event Details:

XX+UX Happy Hour with speaker Erin Malone, hosted by Ancestry.com

July 31, 2014 from 6:00-9:00pm

Ancestry.com

153 Townsend St, Floor 8

San Francisco, CA 94107

Map: http://goo.gl/maps/RXHW2

Free pre-registration is required: https://xxux-ancestry.eventbrite.com

The post XX+UX happy hour at Ancestry.com with guest speaker Erin Malone appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/xxux-happy-hour-at-ancestry-com-with-guest-speaker-erin-malone/feed/ 0
Ancestry.com Awarded Patent for Displaying Pedigree Charts on a Touch Devicehttp://blogs.ancestry.com/techroots/ancestry-com-awarded-patent-for-displaying-pedigree-charts-on-a-touch-device/ http://blogs.ancestry.com/techroots/ancestry-com-awarded-patent-for-displaying-pedigree-charts-on-a-touch-device/#comments Fri, 11 Jul 2014 22:29:27 +0000 Gary Mangum http://blogs.ancestry.com/techroots/?p=2523 In 2011 Ancestry.com joined the mobile revolution and I was given the opportunity to work on a new app that would bring our rich genealogical content to iOS and Android devices.  The original app was called ‘Tree To Go’, but a really funny thing about this name was that the app did not have a… Read more

The post Ancestry.com Awarded Patent for Displaying Pedigree Charts on a Touch Device appeared first on Tech Roots.

]]>
In 2011 Ancestry.com joined the mobile revolution and I was given the opportunity to work on a new app that would bring our rich genealogical content to iOS and Android devices.  The original app was called ‘Tree To Go’, but a really funny thing about this name was that the app did not have a visual ‘tree’ anywhere in the user interface; it provided only a list of all of the people in a user’s family ‘tree’.  We joked that it would have been more appropriately named ‘List To Go’ instead.  We knew that providing a tree experience for visualizing their family data would be an important feature to quickly bring to our customers.  Our small team went to work brainstorming ideas and quickly came up with some rather unique ways to visualize familial relationships.  We were challenged to ‘think outside the box’ by our team lead who asked us to envision the best way to put this data in front of our users taking advantage of the uniqueness of mobile devices with touch screens, accelerometers, limited real estate, and clumsy fingers instead of a mouse.  We needed our design to be very intuitive.  We wanted users to quickly pick up the device and start browsing the tree without reading any instructions.  This was a fun challenge and some of the ideas that we came up with ended up getting described in various patent idea disclosure documents where we had to explain why our solutions presented unique ways of solving the problem.

One night, while pondering on this problem, the idea came to me that a user who is visualizing only a small part of his family tree on the mobile device would be inclined to want to swipe his finger on the device in order to navigate further back into his tree.  If we could continually prepare and buffer ancestral data off screen then we could give the user the impression that he could swipe further and further back in his tree forever until he reached his chosen destination.  And so the idea was born.

7-11-2014 4-24-09 PM

 

We iterated on the idea as a team trying to figure out:

  • what is the correct swiping action to look for?
  • how many generations of people should be displayed on the device and how should they be laid out on the screen?
  • what were the most optimal algorithms for buffering and preparing the off screen data?
  • how would we make the swiping gesture and the animations feel natural and intuitive to the user?
  • should the user be able to navigate in both directions (back through ancestors as well as forward through descendants)? and if so, what would that look like to the user?
  • could this idea handle both tree navigation as well as pinch and zoom?
  • would this idea lend itself to different tree views concepts?
  • what would it mean if the user tapped on part of the tree?

After lots of work and some great user feedback the idea finally became a reality.  The new ‘continuously swiping’ tree view became a prominent feature of the 2.0 version of the newly renamed Ancestry iOS mobile app and has given us a great platform to build on.  I’m pleased to announce that Ancestry.com was recently awarded a patent for this pedigree idea on July 1, 2014 (http://www.google.com/patents/US8769438).

If you’d like to experience the app for yourself, you can download it here.

mobile pedigree 2

 

mobile pedigree 3

 

The post Ancestry.com Awarded Patent for Displaying Pedigree Charts on a Touch Device appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/ancestry-com-awarded-patent-for-displaying-pedigree-charts-on-a-touch-device/feed/ 1
The Importance of Context in Resolving Ambiguous Place Datahttp://blogs.ancestry.com/techroots/the-importance-of-context-in-resolving-ambiguous-place-data/ http://blogs.ancestry.com/techroots/the-importance-of-context-in-resolving-ambiguous-place-data/#comments Thu, 10 Jul 2014 20:01:05 +0000 Laryn Brown http://blogs.ancestry.com/techroots/?p=2512 When interpreting historical documents for the intent of researching your ancestors, you are often presented with less than perfect data. Many of the records that are the backbone of family history research are bureaucratic scraps of paper filled out decades ago in some government building. We should hardly be surprised when the data entered is… Read more

The post The Importance of Context in Resolving Ambiguous Place Data appeared first on Tech Roots.

]]>
When interpreting historical documents for the intent of researching your ancestors, you are often presented with less than perfect data. Many of the records that are the backbone of family history research are bureaucratic scraps of paper filled out decades ago in some government building. We should hardly be surprised when the data entered is vague, confusing, or just plain sloppy.

Take for example, a census form from the 1940’s. One of the columns of information is the place of birth of each individual in the household. Given no other context, these entries can be extremely vague and in some cases, completely meaningless to the modern generation.

Here are some examples:

  • Prussia
  • Bohemia
  • Indian Territory

Additionally, there are entries that on the face of them seem clear, but with more context have new complexity:

  • Boston (England)
  • Paris (Idaho)
  • Provo (Bosnia)

And finally, we have entries that are terrifically vague and cannot be resolved without more context:

  • Springfield
  • Washington
  • Lincoln

If we add the complexity of automatic place parsing, where we try to infer meaning from the data and normalize it to a common form that we can search on, the challenges grow.

In the above example, if I feed “Springfield” into our place authority, which is a tool that normalizes different forms of place names to a single ID, I get 63 possible matches in a half dozen countries. This is not that helpful. I can’t put 63 different pins on a map, or try and match 63 different permutations to create a good DNA or record hint.

I need more context to narrow down the field to the one Springfield that represents the intent of that census clerk a hundred years ago.

One rather blunt approach is to sort the list by population. Statistically, more people will be from a larger city of Springfield than from a smaller. But this has all sorts of flaws, such as excluding rural places from ever being legitimate matches. If you happen to be from Paris, Idaho we are never going to find your record.

Another approach would be to implement a bunch of logical rules, where for the case of a name that matches a U.S. state we would say things like “Choose the largest jurisdiction for things that are both states and cities.” So “Tennessee” must mean the state of Tennessee, not the five cities in the U.S. that share the same name. Even if you like those results, there are always going to be exceptions that break the rule and require a second rule – such as the state of Georgia and the country of Georgia. The new rule would have to say “Choose the largest jurisdiction for things that are both states and cities, but don’t choose a Georgia as a country because it is really a state.”

It is clear that a rules-based approach will not work. But since we still need to resolve ambiguity, how is it to be done?

I propose a blended strategy that takes three approaches.

  1. Get context from wherever you can to limit the number of possibilities. If the birth location for Grandpa is Springfield and the record set you are studying is the Record of Births from Illinois, then the additional context may give you enough data to make a conclusion that Springfield=Springfield, Illinois, USA. What seems obvious to a human observer is actually pretty hard with automated systems. These systems need to learn where to find this additional context and Natural Language parsers or other systems need to be fed more context from the source to facilitate a good parse.
  2. Preserve all unresolved ambiguity. If the string I am parsing is “Provo” and my authority has a Provo in Utah, South Dakota, Kentucky, and Bosnia, I should save all of these as potential normalized representations of “Provo.” It is a smaller set to match on when doing comparisons and you may get help later on to pick the correct city.
  3. Get a human to help you. We are all familiar with applications and websites that give us that friendly “Did you mean…” dialogue. This approach lets a user, who may have more context, choose the “Provo” that they believe is right. We can get into a lot of trouble by trying to guess what is best for the customer instead of presenting a choice to them. Maybe Paris, Idaho is the Paris they want, maybe not. But let them choose for you.

In summary, context is the key to resolving ambiguity when parsing data, especially ambiguous place names. Using a blended approach that makes use of all available context, preserves any remaining ambiguity, and presents those ambiguous results to the user for resolution seems like the most successful strategy to solving the problem.

The post The Importance of Context in Resolving Ambiguous Place Data appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/the-importance-of-context-in-resolving-ambiguous-place-data/feed/ 2
Lessons Learned Building a Messaging Frameworkhttp://blogs.ancestry.com/techroots/lessons-learned-building-a-messaging-framework/ http://blogs.ancestry.com/techroots/lessons-learned-building-a-messaging-framework/#comments Tue, 01 Jul 2014 16:18:01 +0000 Xuyen On http://blogs.ancestry.com/techroots/?p=954 We have built out an initial logging framework with Kafka 0.7.2, a messaging system developed at LinkedIn. This blog post will go over some of the lessons we’ve learned by building out the framework here at Ancestry.com. Most of our application servers are Windows-based and we want to capture IIS logs from these servers. However,… Read more

The post Lessons Learned Building a Messaging Framework appeared first on Tech Roots.

]]>
We have built out an initial logging framework with Kafka 0.7.2, a messaging system developed at LinkedIn. This blog post will go over some of the lessons we’ve learned by building out the framework here at Ancestry.com.

Most of our application servers are Windows-based and we want to capture IIS logs from these servers. However, Kafka does not include any producers that run on the Microsoft .Net platform. Thankfully, we were able to find an open source project where someone else wrote libraries that run on .Net that could communicate with Kafka. This allowed us to develop our own custom producers to run on our Windows application servers. You may find that you will also need to develop your own custom producers because every platform is different. You might have applications running on different OS’s, or your applications might be running in different languages. The Kafka apache site lists all the different platforms and programming languages that it supports. We plan on transitioning onto Kafka 0.8 but we could not find any corresponding library packages like there was for 0.7.

Something to keep in mind when you design your producer is that it should be as lean and efficient as possible. The goal is to have as high throughput for sending messages to Kafka as possible while keeping the CPU and memory overhead as low as possible, so as to not overload the application server. One design decision we made early on was to have compression in our producers in order to make communication between the producers and Kafka more efficient and faster. We initially used gzip because it was natively supported within Kafka. We achieved very good compression results (10:1) and also had the added benefit of saving storage space. We have 2 kinds of producers. One ran as a separate service which simply reads log files in a specified directory where all the log files to be sent are stored. This design is well suited for cases when the log data is not time critical because the data is buffered in log files on the application server. This is useful because if a Kafka cluster becomes unavailable, the data is still saved locally. It’s a good safety measure against network failures and outages. The other kind of producer we have is hard coded into our applications. The messages are being sent directly to Kafka from code. This is good for situations where you want to get the data to Kafka as fast as possible and could be interfaced with a component like Samza (another project from LinkedIn) for real-time analysis. However, messages can be lost if the Kafka cluster becomes unavailable so a fail over cluster would be needed to prevent message loss.

To get data out of Kafka and into our Hadoop cluster we wrote a custom Kafka consumer job that is a Hadoop map application. It is a continuous job that runs every 10-15 minutes. We partitioned our Kafka topics to have 10 partitions per broker. We have 5 Kafka brokers in our cluster that are treated equally, which means that a message can be routed to any broker determined by a load balancer. This architecture allows us to scale out horizontally, and if we need to add more capacity to our Kafka cluster, we can just add more broker nodes. Conversely, we can take out nodes as needed for maintenance. Having many partitions allows us to scale out more easily because we can increase the number of mappers in our job to read from Kafka. However, we have found that splitting up the job into too many pieces may result in too many files being generated. In some cases, we were generating a bunch of small files that were less than the Hadoop block size, which was set to 128Mb. This problem was made evident when we had a large ingest of a batch of small files which had over 40 million small files being loaded into our Hadoop cluster. This caused our NameNode to go down because it was not able to handle the sheer number of file handles within the directory. We had to increase the Java heap memory size to 16 GB just to be able to do an ls (listing contents) on the directory. Hadoop likes to work with a small number of very large files (they should be much larger than the block size) so you may find that you will need to tweak the number of partitions used for the Kafka topics, as well as how long you want your mapper job to write to those files. Longer map times with fewer partitions will result in fewer and larger files, but it will also mean that it will take longer for the messages to be queried in Hadoop and it can limit the scalability of your consumer job since you will have less possible mappers to assign the job.

Another design decision we made was to partition the data within our consumer job. Each mapper would create a new file each time a new partition value was detected. The topic and partition values would be recorded in the filename. We created a separate process that would look in a staging directory in HDFS where the files were be generated. This process would look at the file names and determine whether there are existing table and partitions in Hive. If there are, it would simply move those files into the corresponding directory in the Hive External table directory in HDFS. If the partition did not already exist, it would dynamically create new ones. We also compressed the data within the consumer job to save disk space. We initially tried gzip, which gave good compression rates, but it dramatically slowed down our Hive queries due to the processing overhead. We are now trying bzip2 which gives less compression, but our Hive queries are running faster. We choose bzip2 because of its lower processing overhead, but also because it is a splitable format. This means that Hadoop can split a large bz2 file and assign multiple mappers to work on it.

That covers a few of the lessons learned thus far as we build out our messaging framework here at Ancestry. I hope you will be able to use some of the information covered here so that you can avoid the pitfalls we encountered.


The post Lessons Learned Building a Messaging Framework appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/lessons-learned-building-a-messaging-framework/feed/ 0
Controlling Costs in a Cloudy Environmenthttp://blogs.ancestry.com/techroots/controlling-costs-in-a-cloudy-environment/ http://blogs.ancestry.com/techroots/controlling-costs-in-a-cloudy-environment/#comments Tue, 24 Jun 2014 20:11:03 +0000 Daniel Sands http://blogs.ancestry.com/techroots/?p=2500 From an engineering and development standpoint, one of the most important aspects of cloud infrastructure is the concept of unlimited resources. The idea of being able to get a new server to experiment with, or being able to spin up more servers on the fly to handle a traffic spike is a foundational benefit of… Read more

The post Controlling Costs in a Cloudy Environment appeared first on Tech Roots.

]]>
From an engineering and development standpoint, one of the most important aspects of cloud infrastructure is the concept of unlimited resources. The idea of being able to get a new server to experiment with, or being able to spin up more servers on the fly to handle a traffic spike is a foundational benefit of cloud architectures. This is handled in a variety of different ways with various cloud providers, but there is one thing that they all share in common:

Capacity costs money. The more capacity you use, the more it costs.

So how do we provide unlimited resources to our development and operations groups without it costing us an arm and a leg? The answer is remarkably simple. Visibility is the key to controlling costs on cloud platforms. Team leads and managers with visibility into how much their cloud based resources are costing them can make intelligent decisions with regard to their own budgets. Without decent visibility into the costs involved in a project, overruns are inevitable.

This kind of cost tracking and analysis has been the bane of accounting groups for years, but there are several projects that have cropped up to tackle the problem. Projects like Netflix ICE provide open source tools to track costs in public cloud environments. Private cloud architectures are starting to catch up to public clouds with projects like Ceilometer in Open Stack, but can be a bit trickier to determine accurate costs due to the variables involved in a custom internal architecture.

The most important thing in managing costs of any nature is to realistically know what the costs are. Without this vital information, effectively managing the costs associated with infrastructure overhead can be nearly impossible.

The post Controlling Costs in a Cloudy Environment appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/controlling-costs-in-a-cloudy-environment/feed/ 0