We are excited to announce that the Ancestry.com handwriting recognition competition proposal was accepted as one of seven, official International Conference on the Frontiers of Handwriting (ICFHR-2014) competitions. As part of our competition on word recognition from segmented historical documents, we are announcing the availability of a new image database1, ANWRESH-1, which contains segmented and labeled documents for use by researchers in the document analysis community.
We invite you to visit our competition website to learn more about what the competition entails, prizes offered, and to register if you are interested. A few key dates to note:
- Competition Registration Deadline: March 24, 2014
- Submission Deadline: April, 1, 2014
- Benchmark Database Availability: April 2, 2014
- Results Announced: September 4, 2014
Read on to learn about the ICFHR conference, the Ancestry.com competition and database, and why we are so excited to be sponsoring this competition.
Since 1990 the document analysis research community has been meeting every two years for a series of conferences called ICFHW, the International Conference on the Frontiers of Handwriting Recognition.
Quoting from the ICFHR home page:
ICFHR is the premier international forum for researchers and practitioners in the document analysis community for identifying, encouraging and exchanging ideas on the state-of-the-art technology in document analysis, understanding, retrieval, and performance evaluation. The term document in the context of ICFHR encompasses a broad range of documents from historical forms such as palm leaves and papyrus to traditional documents and modern multimedia documents. … The ICFHR provides a forum for researchers in the areas of on-line and off-line handwriting recognition, pen-based interface systems, form processing, handwritten-based digital libraries, and web document access and retrieval.
The format of the conference is fairly typical with a variety of pre-conference tutorials and the conference proper consisting of multiple parallel tracks of oral and poster presentations. A fairly modern innovation for these kinds of conferences is the inclusion of sponsored competitions that take place in the months leading up to the conference with the results announced and discussed (and in some cases, debated) in sessions on the last day of the conference.
The ANWRESH-1 Database
An important part of our competition is the new database, ANWRESH-1, that we are making available to the document analysis research community. We expect many in the research community will find it interesting and helpful in their work. It consists of about 800,000 “image snippets” of handwritten text drawn from about 4,000 images from the 1920 and 1930 U.S. Censuses. Specifically, we have located (segmented) on each image the Name, Relation, Age, Marital Condition, and Place of Birth fields and labeled them with their ground truth values. An example image is shown below in Figure 1. Note that I have shown in this figure one row (called a record), with each of the fields we are using in this competition labeled with its field type and highlighted in yellow.
The challenge in this competition is to use the ANWRESH-1 database to create field-specific recognizers that can take segmented image snippets of handwritten text in images and automatically transcribe (or assist with the transcription) to create the corresponding textual representations for these fields.
One possible approach for the Birth Place field that takes advantage of the repetition of values common in this kind of collection might be to develop a mathematical model that clusters the ink strokes in a snippet using some distance metric such that similar words (under this metric) belong to the same cluster. The following snippets would be “close together” under this metric and thus, would be in the same (green) cluster.
This clustering algorithm wouldn’t have the slightest idea what characters are formed from the ink strokes, but it would know that the following snippets are different from the snippets in the green cluster (and thus belong together in the blue cluster):
This approach is very powerful when you encounter a document containing birthplace entries like the following:
Once a human keyer identifies the very first occurrence as the text “alabama”2, the clustering algorithm can then automatically label the rest of the alabama fields as being similar or the same, which can then be quickly and easily reviewed by the human keyer. In some cases the repetition of field values could allow this kind of algorithm to reduce the number of fields that are required to be keyed by one or two orders of magnitude.
Is Competition a Good Thing?
One might ask what we hope to gain by sponsoring this competition. Developing and helping the document analysis community advance handwriting recognition technology is a strategic initiative for Ancestry.com. As we have discussed in previous blog posts, the process of converting images of historical documents of handwritten names, dates, relationships and places into a textual representation suitable for searching, is almost all done manually. This transcription process is expensive and time-consuming and is thus a limiting factor in large-scale efforts to extract the data contained in the vast libraries of archived historical documents. Considering the billions of valuable historical documents currently residing on microfilm, microfiche and paper, it’s clear that advancing the capabilities of handwriting recognition systems so as to be able to automate (or even partially automate) the transcription process could be hugely beneficial.
In sponsoring ANWRESH-2014, we are reaching out to researchers developing technologies in word recognition, word spotting, word clustering, machine learning and other related fields to encourage their participation and collaboration. Initially, we want our efforts in this area to generate interest and awareness, to foster connections and enable collaboration. We hope this competition and the ANWRESH-1 database will be an enabler for fresh, unconventional approaches in this difficult, multi-faceted problem. At the conclusion of the competition, at a minimum, we hope to have a much better understanding of the current state-of-the-art for systems for handwritten word recognition on historical documents. As we proceed beyond this competition, we anticipate a spectrum of innovative techniques will emerge. As a growing and diverse community uses increasingly larger, cleaner, and most importantly, shared databases of historical documents to help characterize these techniques, we will see real, albeit incremental, progress in these technologies that will enable us to unlock and make available valuable document collections to family historians that with today’s technologies are simply out of reach.
1. The name of our competition and database, ANWRESH, stands for ANcestry.com Word REcognition from Segmented Historical Documents.
2. The lower-case “a” in “alabama” is because of our “key-as-seen” policy: If the text looks like a lower-case letter, that’s the way it is keyed.