You know how when you are filling out a form on a website sometimes it will ask you to tell it what word you see in a box in order to prove you’re human? Or which images in a grid show fruit? Kind of annoying, right? Well it turns out that when the form is a ReCAPTCHA, you’re helping to digitize old, non-digital images and words. According to Wikipedia, that’s how the archive of the New York Times got digitized.
Old scientist journals are loaded with character, scientific information and historical relevance, all stored up in unsearchable handwritten scrawl. To draw deep insights into these documents, they need to be digitized. The American Museum of Natural History has piles of old journals scanned, but there hasn’t been a quick way of moving them into transcribed text, synced with the original documents. At its recent ‘Hack The Dinos’ challenge in November, museum staff put the scanned journals in front of developers and designers to see if they could come up with a system for crowdsourcing transcriptions.
“The differences between the journal writers was intriguing,” Evie Borthwick, a member of a four-person team that took on this challenge told us in an email. “One author had done such impressive drawings of strata where items were found, with colored ink no less.”
The Observer followed up with a team that took its inspiration from ReCaptcha to build a word-by-word system to turn these pages into data. Here’s their demo at the end of the 24-hour, night-in-the-museum hackathon, which went up online late last month:
“A transcribe system for expeditions, among all the tasks that could be done as a part of the hackathon, particularly appealed to me because it was the most universal in nature,” Smriti Jha, another team member told the Observer in an email. “This seemed like a project that would interest and assist programmers and professionals from fields outside of Paleontology as well.”
Their hackathon project is not the first such effort to recruit hordes of humans online to help with transcription. We recently wrote about an old science fiction novel that got online thanks to volunteer work. On the academic side, DIY History at the University of Iowa is taking on a very similar problem, but with a different strategy. Its goes page by page. Ms. Borthwick explained that approach can be problematic for this set of journals, because sometimes the text ends up sprawling across two adjacent pages (such as when it accompanies a large illustration).
“We took the idea further and decided to completely automate the process of extracting words out of images, and crowdsourcing to transcribers one word at a time instead of one page at a time. It was important to us that we engage users and not overwhelm them with a daunting task,” Ms. Jha explained. This single word approach would have the added benefit of making it easy to put the system on mobile, should the museum ever decide to do so.
One of the biggest problems this data set presented though was the unusual words. On this page, from Walter Granger’s 1898 notebook. That first entry looks like it says “sauropod,” but the second? No clue, and it is tough to guess when it’s clearly a strange science word. Beneath it, what appears to be “audal vertebra” proved to be “caudal vertebra,” according to Google. The current system asks users to either enter a guess into a box or vote a recent guess up or down. “We decided to keep all edited ‘versions’ of the transcribed word in memory, because assuming our users have the best of intentions, if a recently-edited version gets enough negative votes then we want to able to revert back to a previously-edited version,” Ms. Jha wrote.
With more time, the team would have liked to build out the system to auto-suggest words from the scientific lexicon.
One of the largest technical challenges the team contended with was the quality of the scans themselves. “A rescan at a higher resolution would improve the odds of getting useful results,” Ms. Borthwick wrote.
Anyone that wants to carry the team’s work further can snag what they have done so far on Github.