Skip to content
blog_transkribubus-1

+ What is a text? Starting to understand the theory behind Automated Text Recognition

READ-COOP |

What is a text? A simple question with a not so simple answer. Coming from the scholarly editing tradition, Patrick Sahle, Professor at Albertus Magnus University of Cologne, has demonstrated in detail how different the perception or rather the understanding of text can be: from a string of signs on a paper to a work by a literate individual, that has to be (re)constructed from several versions and prints.

To systematically analyze different aspects of a text, Sahle started drawing the so called ‘text-wheel; (there’s a chapter about this in his third volume on scholarly digital editions, p. 45-55; see also Sahle, Patrick: What is a Scholarly Digital Edition?, in: Matthew James Driscoll and Elena Pierazzo (eds.), Digital Scholarly Editing: Theories and Practices. Cambridge, UK: Open Book Publishers, 2016.  OBP.0095, p. 20-39 ).

The result is a range of different entities that a text can be understood as; some of the meanings oppose each other, others do not differ much.

In order to start understanding Automated Text Recognition from a theoretical stand-point, we started discussing with Professor Sahle, how and what form of ‘text’ is recognized in Transkribus (and also in general, if you’re using recognition tools such as OCR engines). The result is our own ‘text-wheel’, drawn by Julia Sorouri.

Most importantly text in Transkribus is understood as signs on a surface; you will need facsimiles or rather digitized images of documents in order to perform Automated Text Recognition.  Through interpretation via machine learning (or typing by a human), it’s possible to produce text as it exists as a document (separated into text and line regions, and possibly word regions too in the future). From this point you can go on to extract text as a linguistic entity or as a work (for example by using Document Understanding technology to identify titles or marginalia) or even build upon entities in the text, understanding text as a carrier of information.

The wheel demonstrates what aspects of a text can be identified and the direction we are aiming at with the READ project.  We want to provide high-quality Automated Text Recognition but we are also thinking about how to assure the validity and plausibility of text.

Let’s start a discussion that goes beyond the quality of text recognition but rather aims at a theory of Automated Text Recognition.

——–

By Dr Tobias Hodel, University of Zurich and State Archives of Zurich.

Share this post