
How to choose the best Transkribus model for your documents

One of the most powerful features of Transkribus is its library of over 300 public AI text recognition models, which are available to every user. These models give Transkribus the knowledge it needs to transcribe a huge range of historical material, from Spanish Gothic print to handwritten Church Slavonic.
Finding the right model for your documents is key for producing accurate transcriptions. But with so many models on offer, how do you choose the right one? In this post, we'll give you helpful hints for choosing the model that will provide the most accurate transcriptions of your documents.
Key takeaways
-
Know your document: Before you start, identify your document's key features: language, script, time period, and whether it's handwritten or printed.
-
Filter your search: Use the search and filter options on the Transkribus Public AI Model Hub to narrow down the hundreds of models to a short list of relevant candidates.
-
Check the stats: Pay attention to the Character Error Rate (CER). A lower CER often indicates a more accurate model.
-
Always test first: Use the "Quick text recognition" feature to test your top model choices on a sample page from your collection before running a large batch.
-
When in doubt, train your own: If you can't find a good public model, training your own custom model is the most reliable way to achieve high accuracy.
The Transkribus platform enables you to produce high-quality transcriptions of historical documents, both printed and handwritten. © Marjory Fleming's Diary via Transkribus
What is a text recognition model?
But before we start, let's take a quick look at what we mean by "model". The easiest way to think of an AI text recognition model is as a specialised instruction manual. It’s a set of rules and patterns that tells the Transkribus software how to read and transcribe a particular kind of text.
A "public" model is one that can be used by any Transkribus user. These are created either by the Transkribus team or by other users who have generously made their models public for everyone to share and benefit from.
Some Transkribus models are very generic. A powerful model like the Text Titan, for example, is designed to be a jack-of-all-trades, telling Transkribus how to transcribe a wide range of material in different languages and from different centuries. Other models are extremely specific and tailored to a particular type of document. For instance, you can find a model trained exclusively on the handwriting of a single person, like this one for a 17th-century Quebecan notary.
Choosing the right "manual" for the job can make a huge difference in how accurate your transcriptions are. For example, if you use a model designed for 19th-century Greek to transcribe 18th-century Danish newspapers, you will likely get a transcription that is complete nonsense. But if you select a specific model for 18th-century Danish newspapers, you are giving Transkribus the knowledge it needs to correctly read your documents. This results in far more accurate transcriptions, saving you a lot of time on corrections.
Transkribus offers hundreds of different models, the majority of which were created by the Transkribus community. © Transkribus
Choosing a model on Transkribus
You can select the right model for your documents in just four easy steps.
Step 1: Get to know your documents
The first step is to take a good look at the documents you want to transcribe. You need to identify a few key features:
-
Type of material: Is your document handwritten or printed? This is the most important distinction, as models are designed for one or the other.
-
Language: What is the language of the text? Transkribus has models for many languages, so make sure you pick the right one.
-
Period: When was the document written? Handwriting and print styles can change a lot over time, so choosing a model trained on documents from the same era is important.
-
Type of script: Some languages, such as Turkish, have used several scripts throughout history. Therefore, it is important to make sure your model is trained for the correct script.
An example of searching for a model for documents in English secretary hand. © Transkribus
Step 2: Use the search and filter options
Once you know the details of your document, you can use the search and filter tools on the Transkribus Public AI Model Hub to find a good match. Here’s how to use them:
-
Select model type: Transkribus offers public models for various different functions, such as table recognition. But for transcription, you need to select a text model.
- Select language: Choose the language of your documents. If your documents are in multiple languages, you might want to consider using a Super Model, such as the Text Titan.
- Filter by text type and century: By clicking the Filter button, you can filter for either handwritten or printed text models, as well as by century. Keep in mind that some models are not tagged with a specific century. Filtering by century will hide these from your search results, so if you don't find a good match at first, try your search again without a century filter.
- Search by keyword: If you want to find a certain type of model, you can also add keywords in the Search Models text box. This is particularly useful for niche models or scripts, where there may only be a few models that match.
On the overview chart, you can see the language, CER, and other vital details about each model. © Transkribus
Step 3: Check the model's stats
After you've filtered the models, you'll have a list of potential options. To help you decide, you should check the statistics for each one. Click on "View Model" to see more details:
-
Character Error Rate (CER): The CER is a key number that shows the percentage of characters the model transcribed incorrectly during testing. A lower CER usually means better performance. Remember that this score is based on the material the model was trained on, so it’s a good guide but not a guarantee of how it will perform on your specific documents.
-
Number of words, lines, and pages: These numbers tell you how much material the model was trained on. A model trained on a large and varied dataset is often more flexible and accurate than a model with a smaller training set.
Step 4: Give the model a test run
The final and most important step is to test the model on a sample of your own documents. At the bottom of each model page, there is a "Quick Text Recognition" feature that lets you transcribe one page of your collection with the model you’ve chosen. This is the best way to see how it will really perform and get a clear idea of whether the model is a good fit for the rest of your material.
The Quick Text Recognition feature allows you to quickly try out each model with your documents. © Transkribus
Struggling to find a good model?
What happens if you've searched through the entire library and still can't find a model that works well with your documents? While Transkribus' public models cover a very wide range of material, sometimes there just isn't one that is a good fit. This is particularly true if you are working with documents in a rare language or a very unusual script.
Fortunately, Transkribus also allows users to train their own text recognition models. This powerful feature lets you create a model that is perfectly tailored to your specific documents. By training a model on your own material, you can teach the AI exactly how to read the unique handwriting or print style you're working with. In most cases, this is the most reliable way of producing highly accurate transcriptions.
Want to train your own custom model? You can learn everything you need to know in our detailed guide on the Help Center or by watching the video below.
In conclusion
Choosing the right Transkribus model is the most important step for getting accurate transcriptions. By taking a little time to analyse your documents, use the filters, check the stats, and test the model on your own material, you can find the perfect fit for your project. Once you have the right model selected, you’ll be on your way to a successful transcription project.