Skip to content
blog_transkribubus-1
Success Story Tibetan Baselines model

Unlocking Sakya texts: Creating a workflow for cataloguing Tibetan manuscripts

Fiona Park
Fiona Park |

Tibetan is one of the world’s major literary languages, with vast collections of philosophical, religious, historical, grammatical, and medical texts written in the language. It is also the language of the Sakya school of Tibetan Buddhism, a tradition that has played a central role in Tibetan intellectual and religious history. Preserving and studying Sakya manuscripts is a goal for many scholars, but like any manuscript work, it is one that comes with challenges.

To address these challenges, a group of academic researchers and scholars from the Sakya tradition recently gathered for a workshop entitled ‘Collection Unveiled: Exploring Newly Discovered Sakya Texts’. The event, hosted by Prof. Jörg Heimbel from the Institut für Indologie und Tibetologie at LMU Munich and supported by Transkribus, aimed to process a new collection of these centuries-old manuscripts, by collectively developing an effective workflow for their digital cataloguing and publication.

One of the attendees was Daniel Wojahn, a researcher in Tibetan Studies at the University of Oxford. We talked to Daniel to find out more about the workshop and how it has helped to tackle the challenge of digitising and cataloguing these important Tibetan texts.

Fraueninsel abbeyThe workshop was held on the beautiful Fraueninsel island on the Bavarian lake of Chiemsee. © Daniel Wojahn

 

Navigating a complex and challenging collection

The focus of the workshop was a newly discovered collection of 92 volumes of Tibetan texts from the Sakya tradition. “[The texts were in] every imaginable condition, [...] from clean block prints and ornate manuscripts to challenging cursives, hybrid hands, and even distorted xerox copies,” Daniel explained. This variety, while a rich source of information for manuscript specialists, poses a considerable problem for systematic, large-scale processing. “The goal was not to produce polished transcriptions, but to develop a practical workflow for identifying individual works, recording metadata, and preparing the material for future philological work.”

Adding to the complexity was the internal structure of the volumes. Rather than containing single, clearly defined texts, many were composite manuscripts bound together over time. “Many of the volumes appear to have been compiled in a rather ad hoc fashion, often containing multiple texts without clear visual divisions,” Daniel explained. This meant that a crucial first step was simply identifying where one text ended and another began—a task that can be extremely time-consuming when done manually.

The Desk feature in Transkribus, which allows documents to be arranged visually into "Collections", was applied to this problem. “The overview allowed us to navigate these complex manuscripts more easily and identify where one text ends and another begins, giving us a strong starting point for our ongoing incipit and colophon work,” Daniel told us.

 

overview with tagsBy sorting the documents into collections, the team had a visual overview of the different manuscripts. © Daniel Wojahn

 

Creating text recognition models for Tibetan

The field of Tibetan text recognition is still emerging, but the workshop was able to build on the foundational work of several other projects. While text recognition in Tibetan is still “largely defined by project-specific models,” these existing resources offered valuable reference points. The team could draw on models from initiatives like the PaganTibet project in Paris, which focuses on early ritual manuscripts; the Divergent Discourses project, working on modern newsprint; and the University of Oxford’s Law in Historic Tibet project. Even though these models were trained on different material, “they often managed to catch obscure abbreviations or contractions that also appeared in our manuscripts,” explained Daniel, demonstrating a useful degree of transferability between different types of Tibetan script.

To improve accuracy further, the group decided to train a baselines model for the collection. “Establishing accurate baselines is a crucial first step to ensure that the text recognition system focuses only on the actual writing and not on “dirt” or other non-text elements,” Daniel explained. “[So] I trained a baselines model on a representative sample of our corpus (about 300 pages) before running a recognition model. Even with a relatively small initial dataset, the system produced usable results within hours.”

 

IlluminationsSome of the manuscripts contain illustrations and illuminations, making it necessary to perform layout recognition first. © Daniel Wojahn

 

Further challenges on the digital frontier

While the group found digital solutions for some of the challenges of Tibetan manuscripts, they also highlighted persistent challenges for such document collections. The first and most fundamental hurdle is the quality of the source images. Daniel notes that this “remains a decisive factor: blurred images, uneven lighting, corrupted or damaged paper, or low resolution can still prevent accurate text recognition,” demonstrating the importance of quality digitisation for successful text recognition.

Beyond the technical aspects of digitisation, the workshop also highlighted a broader, field-wide challenge: the siloed nature of text recognition model training. The project-specific models currently available for Tibetan, while effective for their intended purpose, have limited applicability across the vast diversity of Tibetan scripts and document types. For Daniel, there is one obvious solution: “The field would clearly benefit from closer collaboration and the creation of shared training datasets to develop models that can handle the wide variety of Tibetan hands, scripts, and formats found across collections.” The creation of such shared resources would represent a major step forward, enabling scholars to work more effectively with collections around the world.

 

block with imagesTranskribus was able to create usable transcriptions even after only a few hours of work. © Daniel Wojahn

 

A foundation for future collaboration

The workshop successfully established a digital workflow for this complex Tibetan manuscript collection, increasing efficiency by replacing some manual methods with digital tools. “Using Transkribus sped up our cataloguing work noticeably, allowing participants to move from raw manuscript images to structured data much faster than before,” Daniel concluded.

In the process, participants gained hands-on experience in applying digital methods to manuscript work and began establishing a framework for the continued study of this collection and related materials. The catalogue data produced during the workshop will be made openly available through the Sakya Research Centre, ensuring long-term accessibility and integration with other Tibetan text initiatives. This collaborative approach offers a practical basis for improving access to Tibetan manuscripts and supporting sustained research in the field.

 

Want to use Transkribus for your research?

Scholars around the world use Transkribus to catalogue and transcribe historical documents, making them digitally accessible and searchable. Find out more about how Transkribus can help you with your research, or watch out Beginners' Webinar below to learn more about what Transkribus can do.

 

Headshot in cover image © Daniel Wojahn

 

Share this post