Skip to content
blog_transkribubus-1
Research Spanish 17th century

Unlocking the secrets of the New Spain Fleets with Patricia Murrieta-Flores

Fiona Park
Fiona Park

Historians of colonial Latin America don’t suffer from a lack of primary sources. Across archives in Europe and the Americas lie millions of pages documenting the colonial maritime routes known as the New Spain Fleets. Ship manifests, passenger lists, cargo inventories, and correspondence offer extraordinary insight into global trade, migration, science, and colonial power. The problem is not quantity, but accessibility. The dense, highly variable sixteenth-century handwriting has made it almost impossible for researchers to transcribe and analyse these valuable sources at scale, meaning much of the information they contain remains locked away.

Professor Patricia Murrieta-Flores from Lancaster University and Tecnológico de Monterrey is working to break that impasse. Her project goes far beyond converting images into text. By combining AI-driven transcription with innovative digital workflows, her team is transforming vast archival collections into searchable, analysable data, creating new ways to study the New Spain Fleets and those whose lives were shaped by them.

In this blog post, Patricia shares how her team is using advanced digital methods to unlock these records on a mass scale, what this means for historical research, and why the project matters far beyond academia.



The importance of the New Spain Fleets

The New Spain Fleets were the lifeline of the Spanish colonial empire for over three centuries. These regular voyages between Spain and the colonised Americas facilitated a new kind of exchange between people in a way that the world had never seen before. As Patricia explained: “These fleets transported not only silver and precious resources but also Indigenous plants that would transform global medicine and economies, scientific knowledge that reshaped European understanding of the natural world, [as well as] thousands of individuals whose stories reveal the complex social fabric of the colonial period.”

Luckily for historians, much of the documentation written at the time about the New Spain Fleets has survived. The National Archives of Mexico alone holds more than 52km of sources from the era, such as ship logs and passenger records. A large collection such as this is full of vital information about trade networks, social mobility, scientific exchange, and the experiences of enslaved Indigenous peoples. But manually analysing a collection on this scale would involve processing millions of pages by hand, a task too time-consuming for any research team to tackle.

 The Archivo General de Indias in Spain was another archive that supplied documents for the New Spain Fleets project. Published under Creative Commons Attribution 3.0 Unported 

 

Creating a digital workflow

As a digital humanities specialist, Patricia knew that digital methods could provide a compelling alternative. By creating digital versions of the documents, Patricia and her team can then easily search the entire collection for certain names or words, and link together relevant information from many different sources. But transcribing these kinds of documents had thus far been a hurdle for researchers due to the challenging handwriting styles, abbreviations, Indigenous terms, and often poor-quality scans.

“Only a small fraction [of the documents] has been transcribed due to the specialised palaeographic skills required,” Patricia explained. “With archives holding hundreds of millions of pages, automated transcription is the only viable path to unlocking this vast repository of human history.”

After careful consideration, the team chose to process the documents with Transkribus. This was partly because of its proven track record of producing transcriptions with a CER of less than 10%, but also because it allows humanities scholars to train AI models without being computer scientists. This was essential since most publicly available Spanish models focus on European sources. "The platform's flexibility enabled us to create models specifically tailored to American Spanish colonial documents, which exhibit unique characteristics including hybrid calligraphies, Indigenous language integration, and different writing conventions," Patricia explained.



Training models capable of reading Indigenous Náhuatl was key to understanding the documents in the collection. © Transkribus

 

Training custom models

Creating the Ground Truth training data for these models was a major undertaking. "We built large, high-quality Ground Truth datasets, with expert palaeographers performing manual transcriptions that were then reviewed by a second expert," says Patricia. They also manually adjusted line polygons to ensure that characters extending above or below a line were properly captured during the training process.

The team could have trained one large generic model for the collection. But as the collection included a mix of calligraphic styles, such as Cortesana, Itálica Cursiva, Redonda, Procesal Simple, and Procesal Encadenada, they chose a different strategy. “We adopted a specialised approach: Training separate models for each specific calligraphic style rather than creating one generalised model, which yielded significantly better results,” Patricia explained. The team also continually retrained their models, using the previous iteration as a base model for the new version. This enabled them to improve the accuracy of their models in a time-efficient way.

 

  The team used Named Entity Recognition and Disambiguation to identify and tag entities such as people, places, and Indigenous language terms. © Patricia Murrieta-Flores

 

Analysing the transcriptions

Like in many research projects of this type, transcription was just the start of the digital process. “[We] developed a comprehensive automated annotation system using Named Entity Recognition and Disambiguation (NERD),” Patricia explained. “Using machine learning, we trained models to automatically identify and classify words [...], adapting neural networks with word-embedding algorithms to handle the pre-modern Castilian syntax and the frequent presence of Indigenous language terms.”

As a follow-on to this, the team also developed an annotation software called HistoTag, which enables humanities scholars to create their own ontologies and controlled vocabularies, and automate annotations of historical corpora with machine learning for large-scale analysis. They then combined GIS and corpus linguistics to create Geographical Text Analysis (GTA) software. “[This enables] keyword-in-context searches, geographic collocation analysis, and the cross-referencing of textual information with geographic locations,” Patricia said.

Using these add-on digital tools has enabled the team to perform analyses of the historical documents on a scale that would have been previously impossible. “This approach allows us to identify patterns across thousands of documents in seconds, facilitating both distant reading of large-scale trends and close reading of specific contexts, making it possible to ask and answer research questions that would be impossible to address manually.”

 

The Geographical Text Analysis software allowed the team to cross-reference grographic entities with textual information. © Patricia Murrieta-Flores  

 

Applying the research

With the digital work complete, the team could move on to the historical analysis. They are using the results to study different aspects of the New Spain Fleets era: Social networks, trade routes, scientific knowledge and Indigenous plants, outward marks (non-textual symbols in historical documents), and the Indigenous slave trade. By focusing on these five areas, Patricia and her team can answer longstanding historical questions about the Spanish colonial period, such as how Indigenous medical knowledge travelled, how colonial trade routes differed from late Mesoamerican ones, and how Indigenous enslavement persisted despite being outlawed.

“We selected these five case studies because they address critical gaps in our understanding of colonial history,” Patricia explained. “Together, these studies demonstrate how AI can reveal histories that have remained invisible due to the scale of documentation required.”

You can read more about the team’s case studies on their website.

 

The New Spain Fleets website has further information about the goals and progress of the project, as well as information about the researchers involved. © Transkribus

 

Advice for other researchers

A project of this size is not without its challenges. Along the way, Patricia and her team have overcome many hurdles, and have several pieces of advice for other teams contemplating a digital humanities project on this scale.

  • Invest in Ground Truth: "Invest heavily in high-quality Ground Truth data, it is the foundation of everything."
  • Use specialised models: "Consider training specialised models for distinct document types or calligraphic styles rather than pursuing one generalised model."
  • Prioritise image quality: "Address image quality from the start; advocate for high-resolution digitisation as it dramatically impacts HTR performance."
  • Iterate and transfer: "Embrace an iterative approach, use successfully trained models as base models for subsequent iterations through transfer learning."
  • Collaborate across disciplines: "Build interdisciplinary teams combining palaeographers, historians, computer scientists, and when possible, Indigenous language experts."
  • Share your work: "Make your models publicly available to support collaborative research and benefit underserved communities and institutions."

 

The social impact of this groundbreaking project

The New Spain Fleets project represents a breakthrough in the digital accessibility of historical documents. It uses Transkribus and a range of other digital tools to create a digital corpus of millions of previously inaccessible documents, opening the door for other scholars to carry out research into this era on a scale that would have been previously impossible.

But for Patricia, the social impact of this research is equally important. By creating models capable of transcribing documents in Indigenous languages like Nahuatl, communities can more easily access historical information relevant to their lands, languages, and histories, helping to counter traditional narratives and decolonise history in Latin America.

“This project [demonstrates] that AI can unlock centuries of historical information not just for academic researchers, but for archives, libraries, Indigenous communities, and the general public throughout Latin America and beyond. For me, this is not just about technology, it's about justice, accessibility, and ensuring that marginalised voices and histories can finally be heard at scale.”

 

Thank you, Patricia, for giving us such a great insight into your work!

Cover image courtesy of Tecnológico de Monterrey

Share this post