-1.png?width=1280&height=720&name=LinkedIn%20test%20(26)-1.png)
Can AI read the Arabic script?

The Arabic script is one of the most widely used writing systems in the world. For over 1,500 years, it has been used to write not only the Arabic language but also a host of others across Asia, Africa, and Europe, including Persian, Ottoman Turkish, Urdu, Pashto, and Malay. Its cursive nature, where letters change form depending on their position in a word, makes it beautifully complex but also a significant challenge for automatic text recognition.
For historians, archivists, and researchers working with documents written in the Arabic script, this complexity can make transcription a time-consuming and difficult process, even with the help of AI. As the tried-and-trusted solution for documents in the Latin script, a growing number of researchers and institutions are choosing to use Transkribus to transcribe Arabic-script documents, and contribute to our knowledge of using automatic text recognition with this beautiful script.
Here are three models, created by the Transkribus community, that can help you unlock your Arabic documents with AI.
Reading the Arabic script with AI:
- AI text recognition platforms can read and transcribe historical documents in the Arabic script. For example, software like Transkribus uses specialised AI models to recognise different forms of the script, allowing users to convert images of historical documents into digital, searchable text.
- Transkribus offers models for 18th-century printed Arabic, printed Ottoman Turkish, and both printed and handwritten Ottoman Turkish.
- These models can be selected when performing text recognition with Transkribus.
A page from the Book of the Noble, Pure Gospel and the Bright, Illuminating Lamp, printed in Aleppo in 1706. Available via license: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
Transcribing the first Arabic prints in the Middle East
The story of printing in the Arabic script is a fascinating one, and this model, named Dabbas OCR, is trained on materials from a pivotal moment in that history. The model is specialised for the unique typographic output of the Dabbās printing press established in Aleppo between 1706 and 1711. This press was founded by Athanasius Dabbās, the Metropolitan of the Antiochian Church, and it represents the first successful attempt to print Arabic books with Arabic types within the Middle East.
The typefaces used in Dabbās’s books are particularly interesting because, while printed, they retain hybrid features that reflect older manuscript traditions. These include ligatures, overlined diacritics, and stylised letterforms, which can be challenging for standard OCR software. The model has been trained on high-resolution scans of the books printed at this press, which primarily consisted of liturgical texts from the Byzantine Orthodox tradition, such as the Four Gospels, the Psalter, and the Horologion. For researchers working with early Arabic printed materials, especially those with multilingual Greek-Arabic layouts or from a Christian context, this model offers a highly specialised starting point for transcription.
The Ottoman Turkish Print model was trained on a series of periodical from the late Ottoman period. © Digital Ottoman Corpora
Unlocking 19th-century Ottoman Turkish print
Until the Turkish language reform of 1928, Ottoman Turkish was written using a version of the Arabic script. Because of this, the vast administrative, literary, and journalistic output of the Ottoman Empire’s final century remains a challenge to access for many researchers. The Ottoman Turkish Print model was designed to address this by focusing on printed materials from the late 19th and early 20th centuries.
This powerful model was created by the Digital Ottoman Corpora team, a group dedicated to making Ottoman Turkish texts more accessible for digital scholarship. They trained the model on a diverse selection of documents, including six different periodicals from the period and an Ottoman Turkish dictionary. This varied training data makes the model robust and capable of handling the different fonts and layouts found in the press of the era.
A key feature for users to note is that the model's output adheres to the “half-transcription” latinisation scheme. This standard, recommended by the Turkish Historical Association for late Ottoman print, is a specific system for transliterating the Arabic script into the Latin alphabet, which is vital for consistency in academic research. For any institution holding collections of Ottoman-era newspapers, journals, or books, this model can dramatically accelerate the process of creating searchable, machine-readable text.
→ Read more about the model in this Success Story
The "Mecmua" poetry collection was one of the handwritten materials used to train the Ottoman Turkish Generic model. AF222a: 6r. in: mecmua online
Creating a model for both handwritten and printed Ottoman Turkish
For researchers working with a variety of Ottoman Turkish documents, the Ottoman Turkish Generic model offers a versatile solution. Created by Milanka Matić-Chalkitis as part of the MultiHTR project, this model is uniquely trained on both handwritten and printed materials in Ottoman Turkish. This mixed training data makes it a flexible tool for collections that feature a range of document types.
The handwritten data used to train the model includes the poetry collection 'Mecmua', the collected poems of the Ottoman poet Keşfī, and a selection of travelogues and military correspondence. For the printed materials, the model was trained on various newspapers and journals from the late Ottoman period, generously provided by the Digital Ottoman Corpora team. This diverse training set, which also benefits from the 'data recycling' principle, gives the model a high degree of flexibility, allowing it to handle variations in layout, font, and physical quality.
The model is designed as an auxiliary tool, particularly helpful for users who may not be experts in the Ottoman-Turkish language or the Arabic-Persian script. It provides a solid starting point for transcribing a wide array of documents, from personal letters and poetry to official publications.
How to use these models with Transkribus
Working with historical documents in the Arabic script, whether printed or handwritten, requires specialised tools. Using AI to automate the transcription process allows researchers, librarians, and archivists to save countless hours of manual work, freeing up valuable time and energy to focus on interpretation, analysis, and curation.
To learn more about how to use these models, or how to train your own for a specific collection, please visit our Help Centre at help.transkribus.org or watch the webinar below for an introduction to using public models in Transkribus.