How to digitise newspapers with Transkribus
Historical newspapers are valuable resources for anyone looking to understand the past. They offer a day-by-day account of life, capturing everything from major political shifts to local advertisements and community news. For libraries, universities, and archives, these collections are some of the most requested items, yet they are also some of the most difficult to research. By digitising these archives and transcribing the text, you can move away from manual page-turning and allow researchers to search for specific names, dates, or keywords across thousands of pages in seconds.
While the uniform typefaces used in newspapers is ideal for AI transcription, the complex, multi-column layouts typical of old newsprint can make digitisation a challenge. In this post, we break down how to use Transkribus to turn those stacks of paper into useful digital data, from the initial scan to the technical settings needed to get the layout recognition right.

The complex layout of newspapers can make them difficult to transcribe with AI. © Bjankuloski06 via Wikimedia Commons
Why newspapers are challenging for AI to read
Most documents, like books or letters, are read from top to bottom in a single block. But newspapers have a much more chaotic structure. They use complex grids where stories start in one column and skip over to another, or are interrupted by images, horizontal dividers, and large-print advertisements. If you run a standard "out of the box" transcription, the AI might try to read straight across the page, merging three different articles into a single, nonsensical sentence.
To get a clean transcript, you have to help the AI understand the geometry of the page first. This involves a specific three-part sequence: first, detecting the large "regions" (the boxes that hold articles); second, finding the "baselines" (the invisible lines the text sits on); and finally, running the actual text recognition. Without this structured approach, the reading order will be broken, making the data useless for serious research.

The first step in any digitisation project is to create high-quality scans of the source material. © Transkribus
Step 1: Scanning your collection
The foundation of a good digital archive is the quality of the raw image. Newspaper pages are often large, thin, and yellowed, which can make them tricky to photograph without shadows or "bleed-through" from the other side.
- Use the ScanTent: If you have fragile items or a smaller budget, the ScanTent is a great tool. It provides a stable platform for a smartphone or camera, ensuring the lens is perfectly parallel to the page. This prevents "keystoning," where the text at the bottom of the page looks smaller than the text at the top, which can confuse the AI.
- Check the resolution and lighting: Aim for at least 300 dpi. High contrast is recommended, as you want the black ink to stand out sharply against the paper. If the paper is very thin, putting a sheet of black paper behind the page you are scanning can prevent text on the reverse side from showing through.
- Upscaling for small print: Historical newspapers often used tiny fonts to save space. If the letters look like dots when you zoom in, try doubling the image size (interpolation) before uploading to Transkribus. This gives the layout engine more "surface area" to detect the fine loops and tails of the characters.
To prepare Training Data for a Field Model, you need to mark the different layout elements in your documents. © Transkribus
Step 2: Using Field Models for segmentation
A Field Model is a sophisticated type of model in Transkribus that focuses on semantic layout analysis. Instead of just seeing "a box of text," it can be trained to understand "this is a headline" or "this is a footer", making it the most effective way to handle the vertical lines and columns found in newspapers.
- Train for specific structures: You can use Field Models to identify different parts of a page like "Article," "Heading," "Advertisement," or "Image." By categorising these first, you can later tell Transkribus to only transcribe the "Article" regions and ignore the ads, saving time and processing power.
- Handling the columns: The Field Model acts as a barrier. It draws distinct boundaries around columns so that the line-detection tool knows exactly where to stop. This ensures that the reading order stays within the column from top to bottom before moving to the next one.
- Removing "noise": Many archival documents contain margin notes, library stamps, or printer marks. You can train a Field Model to label these as "ignored," so they don't get mixed into your final text search results.
Configuring the correct settings is key to successful layout analysis with newspapers. © Transkribus
Step 3: Locating the baselines
Once your Field Model has identified the regions, you need to find the baselines—the horizontal lines of text on the page. This is a critical step because the text recognition AI follows these lines like a train on a track. When using Field Models with newspapers, you should use the Advanced Settings in the Layout Recognition menu to fine-tune the baseline detection.
- Select the right model: In the "Recognise" tab, choose "Layout." For newspapers, the "Mixed Line Orientation" models are usually best because they are trained to handle text that might be slightly slanted or packed tightly together.
- Preserve your regions: This is the most important setting. Set "Generation of Text Regions" to "Keep existing". This tells Transkribus to use the column boxes you already created with your Field Model rather than trying to draw new ones from scratch.
- Fine-tune the line length: Set the "Minimal Baseline Length" to a low value (e.g., 10 or 20). Because newspaper columns are narrow, some lines—like a single word at the end of a paragraph—are very short. If this setting is too high, the AI will ignore those short lines.
- Prevent column jumping: Always set "Split Lines on Regions border" to "Yes." This acts as a digital pair of scissors, ensuring that no baseline accidentally crosses over the gutter between two columns.

There are hundreds of models to choose from when transcribing the text in your newspapers. © Transkribus
Step 4: Training and transcription
With the layout fixed and the lines drawn, you are ready to turn those images into text. To do so, you will need to choose a Transkribus model for the text recognition, which can either be a pre-trained public model, or one that you have trained yourself.
- Use a public model: Before deciding to train your own model, check the "Public Models" hub. There are powerful models for most major languages, such as English, German (including Fraktur/Gothic scripts), and Dutch, that have been trained on millions of words. Often, these work at 95% accuracy or higher with printed text such as newspapers .
- Train your own model: If your newspaper uses a very rare font, you might need to create a custom model. You will need to transcribe about 50 pages manually to create Ground Truth. Once the AI learns the specific shapes of your newspaper’s letters, the error rate will drop significantly.
Where can I find out more?
Newspapers are one of the trickier resources to digitise but well worth the effort. Once processed, your archive transforms from a collection of images into a data-rich resource where centuries of history can be searched in an instant.
Want to learn more about digitising historical documents with Transkribus?
-
Check out this article on our Help Center about transcribing newspapers with Transkribus
- Watch this webinar on our YouTube channel for more information about using Fields Models
