+ Unleashing the Trankribus API
by David Brown and Stephen Crane, Trinity College Dublin
On 30 June 1922, at the outset of the Irish Civil War, a cataclysmic explosion and fire destroyed the Public Record Office of Ireland at the Four Courts, Dublin. Flames and heat consumed seven centuries of Ireland’s recorded history, stored in a magnificent six-storey Victorian repository known as the Record Treasury. On the centenary of the 1922 blaze, the Beyond 2022 project at Trinity College Dublin will unveil Ireland’s Virtual Record Treasury—a digital reconstruction of the Public Record Office of Ireland building and its collections.
Large parts of these collections were copied prior to the fire: the work of antiquarians, historians and publicly funded projects that intended to publish the most historically significant parts of the collection as printed source material for scholars. For various reasons, only a small proportion of what were huge transcription projects were ever published, but copies survive in manuscript running to millions of pages of handwritten text. The transcriptions were made between the seventeenth and nineteenth centuries in the trained secretarial hand of the times. Most projects were entrusted to a single transcriber, usually an expert in a particular field and some individuals transcribed up to 25,000 pages over a period of many years. With so many examples of very large quantities of text produced by a single hand, the Irish Record Office transcriptions might as well have been prepared with Transkribus in mind.
The collections reflect the cataloguing arrangements in the original record office and the largest sets of copies deal with topics central to the study of Irish history: The Elizabethan conquest and Administration, the Plantation of Ulster, the Cromwellian occupation of Ireland, the Williamite wars and the breaking up of the great landed estates in the nineteenth century. All areas of history are covered in these transcripts, however, and the material includes early census-type records, trade, legal judgements and a wide range of smaller thematic collections related to specific towns and cities. The digitisation is most advanced for the Cromwellian period, 1650-1659, and the scale of documents recovered surpasses that which has survived for most parts of England.
Transkribus works very well on large, relatively uniform collections such as these. Several HTR models have been prepared for 15,000 words each, beginning with the nineteenth century hands and achieving, in some cases, a Character Error Rate (CER) of less than 2%! As the number of trained models increased, a separate project emerged to investigate if the existing models could be used to partially recognise a sample from the next set of documents, and speed up the process of creating each subsequent set of ground truth. It was decided to create a single page ground truth for each new example, and compare this with text automatically generated with each model in the project to find the best one to work with.
Transkribus comprises a cross-platform client GUI which is downloaded and executed on users’ local machines, Windows, Mac or Linux. This GUI communicates with a remote server over the Web. The server allows to manage collections of documents, train HTR models and run models against document collections, all in response to user-requests through the GUI.
Unusually, the Transkribus project has separately published an open-source client library which the GUI uses to make requests to the server. As part of a summer project we decided to use this library as the basis for a scripting language, allowing us to write mini-programs (scripts) automating common tasks separately from the GUI, but using the same back-end services as it.
The client library as shipped is written in the Java programming language, which runs on a virtual machine known as the JVM, and which enables the client to be cross-platform. We decided to base our scripting language on Clojure, an idiomatic modern Lisp which also runs in the JVM and provides excellent Java interoperability.
Our scripting language, which we call Transkript, is also published as open-source, on Github. It does not implement all of the underlying API, just enough to enable a couple of small scripting applications: eval-models and run-ocr.
The first script compares multiple trained models associated with a collection, using the first page of a specified document. Using the GUI this would be a laborious affair since running each model takes some time. A user can run our script and return later to browse the results.
The second script is used to upload a folder of images representing pages of a typewritten document, run OCR on it, and download the text output of the OCR process.
The power of our approach is that each of these scripts took only a couple of hours to write and test, and the core of each of them is about a dozen lines of fluent code, which is quite comprehensible, even to relatively non-technical users. The scripting language does not add any new functionality to Transkribus, but enables dramatically increased productivity through the batch processing of large numbers of jobs. There are multiple additional scripts that can be employed, for example to HTR documents automatically once the most appropriate model has been identified by the eval-models script.
- For more information on the Beyond 2022 project contact David Brown, brownd4@tcd.ie
- For more information on Transkript contact Stephen Crane, jscrane@gmail.com
- Transkript can be found at: https://github.com/jscrane/transkript