
Creating a model to transcribe 250,000 specimen labels at the Museum für Naturkunde Berlin

The Museum für Naturkunde Berlin - Leibniz-Institute for Evolution and Biodiversity Science (MfN) is one of the largest institutions of its kind in Germany, with an expansive collection of natural history specimens from around the world. These specimens are a vital resource for researchers and educators, many of whom would like to be able to access them digitally. But the sheer scale of the collection has made digital accessibility both a logistical and a technological challenge.
However, READ-COOP, the cooperative behind Transkribus, is always up for a challenge. So when we first heard that the MfN was looking for a service provider to transcribe handwritten and printed labels accompanying its entomological collection, we were keen to work with them. With the project now complete, we spoke to Franziska Schuster from the MfN's digitisation team about the process, and about how the project has enhanced the work of both the MfN and of Transkribus.
The Museum für Naturkunde Berlin is a leading research centre for natural history. ©MfN Berlin
A challenging collection
Entomology is the study of insects, and the entomological holdings of the MfN include approximately 15 million insect specimens. Most of these are accompanied with small handwritten or printed labels containing key details about the specimen. As Franziska explained: "[These labels contain] information about the place and date of gathering, the collector and often other scientific data, for example, about the expedition during which the animal was collected or how it was collected.”
Currently, most of this data is only available in paper form, making it difficult for researchers to search for the details of specific specimens. The MfN had already attempted to improve access by digitising some of the labels and transcribing them manually. But due to the size of the collection, it was hard to carry this out in a comprehensive way. “When specimens from the collection are recorded in a database, we often only record a minimum data set and not all the information noted on the label due to time constraints, as it is not feasible to record all the data manually,” Franziska said.
Many of the labels in the collection contain both handwritten and printed lines. © MfN Berlin
Joining forces with Transkribus
The museum had previously experimented with Transkribus for transcribing historical documents. Franziska explained: “During Covid19, we started initial trials to make digitised archive materials machine-readable with Transkribus. Our transcription workshop also used Transkribus — citizen scientists supported research projects by transcribing documents written in old German handwriting." These early interactions laid the groundwork for further experimentation, but the team had concerns whether the platform could handle the specific requirements of label transcription. “Early tests showed that layout recognition, readability, technical language, and data processing were really challenging.”
Because of this experience, the team thought that manual transcription was their only answer, until they came across Transkribus again on a different project. "We were looking for a partner to create a Ground Truth dataset for a quarter of a million label photos from entomology," Franziska explained. "At the time, we assumed that this could only be achieved through manual transcription. However, the concept READ-COOP presented to us convinced us to try handwritten text recognition and named entity recognition." And so, the Berlin museum awarded the Austrian cooperative the tender to digitise and transcribe the 250,000 labels.
Many of the labels use outdated terms, such as "Formosa", a previous name for Taiwan. © MfN Berlin
The first project hurdle
While you might not notice it when using the platform, processing a document in Transkribus actually involves two separate steps: it first recognises the text-lines and then it transcribes the text. At the start of the MfN project, this conventional sequential workflow was used to process the images, but it struggled to produce workable transcriptions of the labels. "The sequential approach using the models implemented in the Transkribus platform did not result in the quality we were hoping for," said Franziska.
READ-COOP proposed two solutions to try and improve the quality. Firstly, they asked the MfN to provide more Ground Truth transcriptions of handwritten labels, as these were the ones the models were struggling most with. "Our collection staff sat down and manually transcribed 5,000 handwritten labels," Franziska recalled. "Only taxonomically trained collection staff with excellent knowledge of the MfN insect collection could provide a gold standard, which limited the people available for this job to just a handful."
The end-to-end model was capable of producing accurate transcriptions, even with a mix of vertical and horizontal lines. © MfN Berlin via Transkribus
Creating a new kind of model
The second solution was the development of an end-to-end document processing pipeline: a first for READ-COOP. Instead of documents being processed in two steps, READ-COOP's R&D team developed a new workflow that incorporated the text-line recognition and transcription into one single end-to-end model. This is not only more efficient, but also improves accuracy as the model is more knowledgeable and able to extract information about the kind of text on the page.
With the previous sequential workflow, a text-line recognition model would be used to detect where on a page a horizontal text-line is and where a vertical text-line is. Those text lines would then be extracted and fed to the transcription model. However, the transcription model wouldn't have any context about where on the page that text-line came from, making it harder for the model to accurately predict the text it contains.
But because the end-to-end model performs both the text-line recognition and the transcription, it has more knowledge about the context of the text-line. It might learn, for example, that vertical text-lines are dates, or names are often written in the bottom corner of the page. This visual information narrows down the options for the model when predicting the text on the page, helping to create more accurate transcriptions.
By combining more quality training data with a more sophisticated end-to-end model, the team were able to dramatically improve the accuracy of the transcriptions. "The [end-to-end] model processed [the 5000 extra transcripts of training data] for several days, resulting in a noticeable increase in recognition quality."
GeoNames is an open-source database containing geographical entities from around the world. © GeoNames
Implementing named entity recognition
Of course, creating a digital version of an archival collection involves more than just transcribing the words. The labels in the MfN’s collection often feature the same type of information (for example, the species name) in roughly the same position on the label (for example, at the top of the label). This made them suitable for using with named entity recognition (NER) — a process in which the software automatically detects and tags certain entities in the documents, such as the name of a species or the collection date. Once identified, these entities can then be extracted and compiled into another structured data form, like a spreadsheet or an XML file.
One of the key entities in this collection were the locations in which the specimens were collected. Instead of spending resources tagging these locations manually, READ-COOP created an NER model which was trained to recognise which words on the label refer to the collection location. This allowed the locations in the entire collection to be automatically recognised and tagged.
These location tags were then enriched by linking the recognised place names to a unique reference ID in the GeoNames database. As READ-COOP’s R&D Lead, Michael Ustazewski, explained: “Geonames is an open-access database that collects real-world geographic entities by assigning them a unique ID and collecting information about each one, such as the coordinates and alternative names. Linking location entities to GeoNames IDs makes it easier to conduct downstream analyses and interpretations of the raw data.”
"Additionally, these unique IDs are important for identifying real-world identities among ambiguous place names. For instance, there is the city of "London" in the United Kingdom, but there are towns called "London" in the USA, Canada, and many other English-speaking countries, too. Despite having the same name, each of them refers to a different real-world entity. By enriching the location entity with the GeoNames ID, we are able to see which location is meant on which label.”
End-to-end models produce better transcriptions even with diverse collections.b © MfN Berlin via Transkribus
A successful collaboration
After several months of hard work from the teams at the MfN and READ-COOP, a final dataset was finally complete, comprising 250,000 specimen labels transcribed and tagged with named entities. This dataset can now be used in further digital projects by the MfN, giving researchers better access to the museum's entomological collection.
And it wasn't just the outcome that was successful, but the shared journey that led to it. “The collaboration was very enjoyable, and we learned a lot about the current possibilities and limitations of AI, as well as about preparing digitisation data for such projects," Franziska said. "It was particularly nice that our partners at READ-COOP approached the project with the same curiosity and enthusiasm for discovery as our scientists.”
For READ-COOP, the project marked the first time an end-to-end model had been successfully implemented with real-world data—a big achievement. There are already plans to implement this kind of model into Transkribus in the future, something the MfN would also be happy to see: "We are also delighted that some of the features tested during our collaboration might be made available to all users as part of the Transkribus platform in the future," Franziska told us.
In conclusion
This project illustrates the potential of using end-to-end models to improve efficiency with large-scale digitisation projects. However, it also shows that even the most sophisticated technologies are dependent on high-quality training data and the knowledge of experts in the field. It is the combination of all these elements that makes for the seamless digitisation of heritage artefacts.
We would like to take this opportunity to thank the MfN for collaborating with us on this groundbreaking project, and especially to Franziska for taking the time to write this blog post with us.
Interested in learning more about how Transkribus can support digitisation and transcription at your heritage institution? Contact us today and start the conversation.