One of the insights that I brought to my current job from the localization industry many years ago is the idea that you should create once and export or publish many times. In that industry we had the concept of an asset – say an instruction manual on how to use your microwave – that was created once in a base language then repurposed in dozens of other languages and markets as needed. If you want to expand distribution to 20 countries you definitely want each subsequent book to be a lot less costly than the first.
Additionally, when creating your first version of the asset, you want to be able to take inputs from all sorts of different sources, text files, images, spreadsheets, etc.
When processing large volumes of historical records for Ancestry.com, the same principle applies. I call it “Hourglass-shaped Data Processing.”
Today, the team I work with processes historical birth, marriage, or death records, transforming the data from an unstructured, narrative state to a searchable and fielded state. Think of an obituary in the newspaper being transformed into fields and records in a database.
In order to do this successfully, we have to take data from all kinds of different sources – OCR, manually keyed, customer corrections, third-party websites, cached HTML pages, GEDCOM files, internal database tables, XML exports, etc. We then transform that data into a common form for structuring and enhancement. Once the data has been processed, we export that data to the various internal departments in the form they need it – sometimes back out in the same form that we imported it, while other times it is transformed into something entirely different.
This input and output flexibility, with a common set of processing tools in the center for centralized and consistent data structuring saves time and resources. The alternative might be easier when you have one or two data formats to deal with, but this quickly breaks down once you try and scale either your inputs or outputs.
For example, let’s say that an internal department would like to use our data services engine to normalize a bunch of dates in a field. The dates have been entered haphazardly without any validation or normalization. This department happens to have the data in a CSV file format with no column headings to map the data to, just the position of the columns. It would be pretty easy to create a custom pipeline that reads in a .csv file and emits a .csv file out the back end. But what do you do when request number two comes in and the input format is now a table in SQL and the output file needs to be in XML? If your core data processing system is coupled to your input type or output type, soon it will be impossible to keep all of your branched code up to date with your latest algorithms.
Instead, we took the time to write an input parser that takes whatever we get and moves it to a common format. The real guts of our data processing system then only have to deal with one type of input format. It also exports the data in this same format. We then write an exporter that transforms the data to the target format.
The initial cost in architecture and setup is quite a bit heavier than just writing for one input and one output, but this quickly pays off when you start adding new data types and formats on either end of the hourglass.
In a real world example, we had an internal department who needed to run billions of records through our field normalization engine. They handed the files to us in JSON format and wanted them exported an API that was fronting a database. So we wrote a custom importer and a custom exporter and we finished up in a week or so. But then they found that didn’t like the performance of the output format and changed the input format as well. It was a simple few days to change over to a custom XML input and output to a flat file.
This flexibility has allowed us to take inputs in dozens of flavors and formats and export in nearly that many ways, all without having to make code changes to the core service we provide. I would recommend this approach to anyone processing large amounts of data.