Skip to content
blog_transkribubus-1
Text recognition models

Is it worth training a model on Transkribus?

Fiona Park
Fiona Park |

When you start a project in Transkribus, one of the key decisions you have to make is whether you want to use a public model to transcribe your documents, or invest time training a custom model for your project.

There is no “one-size-fits-all” approach here—while one collection might be best served with a public model, another collection might require a custom model in order to achieve accurate transcriptions. In this blog post, we will look at all the different factors that go into making this decision, to help you decide which method is right for your collection.

 

Key Takeaways

  • Always test public models first: Before you do any training, always test a powerful, general model like "Text Titan." It’s fast, requires no effort, and may be perfectly sufficient for your needs.
  • Use custom models for unique material: If your documents feature difficult or unique handwriting, specialised terminology, or complex layouts (like tables and forms), a general public model will likely fail to produce accurate results.
  • Custom models are more efficient for large collections: For large-scale projects, the time invested in training a model is nearly always repaid, as a custom model drastically reduces the time needed for manual correction.


Screenshot of Transkribus Public AI Model HubThe Transkribus Public AI Model Hub allows you to search the entire catalogue of public models and find one that is right for your documents. © Transkribus

 

The first step: Always try a public model

There is a golden rule when starting any new project in Transkribus: Always try a public model first. This step is crucial as it provides an immediate, low-effort baseline for your transcription accuracy. Transkribus offers a wide variety of public models, but the most powerful and versatile is the "Text Titan". This general-purpose model was trained on an enormous and diverse dataset, enabling it to produce surprisingly accurate transcriptions straight out of the box. 

If you achieve good results with the Text Titan, you have found the best of both worlds: highly accurate transcriptions produced almost instantly, with no need to create your own training data. If the Text Titan does not provide the accuracy you need, you should also explore other specialised public models that have been trained by the Transkribus community. 

For each public model, you can see a description of the model and for many models, view the training data it was trained on, to see how close it is to your documents. You can also use the Quick Text Recognition feature at the bottom of the page to test a few of your own documents and see how well the model does. If, after testing these, the accuracy is still low and requires extensive manual correction, then it is time to think about training your own.

 

Tibetan_plant_manuscript_Wellcome_L0041705 (1)The low-resource language and complex layout mean that this Tibetan manuscript would benefit from custom model training. © Wellcome Collection via Wikimedia Commons

 

Collection types that work well with custom models

While generic public models work well with many collections, there are some types of collections that work particularly well with custom models. If your collection meets any of these criteria, then you may particularly want to consider training a custom model.

  • Unique or difficult handwriting: This is perhaps the most common reason. A general model is trained on an average of thousands of different hands. It has not been trained to recognise the specific idiosyncrasies of your material. If your collection consists of documents from a single scribe, a personal diary, or a lifetime of correspondence from one individual, a custom model can learn that person's unique way of forming letters and abbreviations. 
  • Specialised content: Models do not just learn the shape of letters; they also learn language patterns and vocabulary. Public models are trained on general texts and will falter when faced with highly specialised content. If your documents are full of technical jargon, obscure terminology, non-standard spellings, or a complex system of abbreviations—such as in 17th-century alchemical texts or 19th-century medical ledgers—a custom model can learn this specific vocabulary, dramatically improving accuracy.
  • A specific language or dialect: While major world languages are well-represented in public models, many projects focus on materials that are not. This includes low-resource languages, historical dialects with non-standard orthography, or documents that feature multiple languages. A custom model trained on your specific material is often the only way to achieve high accuracy.
  • Complex layouts: Documents are often more than just a single block of text. Archives and libraries are full of materials with complex layouts: Tables, forms, parish registers, newspapers with multiple columns, or manuscripts with marginalia. A standard transcription model will attempt to read this text line by line, resulting in a jumble of data. Transkribus allows you to train specific "Field Models" or "Table Models" to recognise the structure of the page, correctly extracting data from specific columns, fields, or regions, helping it to then create an accurate transcription.

 

File:Archives nationales PR3.jpgCustom models are usually worth training for large collections, such as this collection at the Archives nationales in France. © Chris93 via Wikimedia Commons

 

Custom models are more efficient for large collections

Even if your collection does not fit neatly into one of the categories above, training a custom model may still be the most efficient choice, particularly for large-scale projects. This decision comes down to a simple cost-benefit analysis of your time.

Consider an archive with 20,000 pages of city council minutes from the same period. A public model might yield a 15% Character Error Rate. This may seem helpful, but it means that roughly one in every seven letters is wrong, requiring a full human review and correction of every single page, taking months to complete.

Now, consider the custom model workflow. You might spend 50 to 100 hours manually transcribing 100 pages to create a high-quality Ground Truth dataset. This is the "cost". However, once you train a model on this data, it may be able to transcribe the remaining 19,900 pages at a 3% CER. This level of accuracy is so high that the text may be usable for many purposes with only minimal spot-checking. The initial time investment is saved many times over. For large, homogenous collections, or for projects requiring highly accurate data for digital editions or computational research, a custom model is the most reliable and efficient path.

 

Conclusion

In short, the journey for every project should begin with the simple, fast, and powerful public models like the "Text Titan". For many, this will be all that is needed. However, if your collection is defined by its uniqueness, or if a public model is simply not producing accurate results, then it is worth investing the time to train a custom model. For large-scale institutional projects, this investment is consistently repaid in long-term efficiency and the high-accuracy automation it provides. Whichever method you choose, Transkribus gives you the tools you need to unlock your collection.

 

Want to find out more about transcribing historical documents with Transkribus? Check out our website, our Help Center, or our YouTube channel for more information about our innovative, cooperative-run platform.

 

Share this post