Publications

LwMDB: Open Metadata for Digitised Historical Newspapers from British Library Collections

Published in Journal of Open Humanities Data, 2025

The Living with Machines Database (LwMDB) represents one of the crown jewels of the Living with Machines project, showcasing the convergence of the project’s multiple newspaper-related datasets into a single, enriched resource for researchers interested in digitised British newspapers from the long nineteenth century. By combining British Library (BL) catalogue data, FindMyPast (FMP) metadata, and bespoke datasets such as WikiGaz and Mitchell’s Newspaper Press Directories, LwMDB enables an enriched view of digitised newspapers. Simplified CSV exports provide easy-to-use, reproducible access, while the database’s scalable architecture, built using Django and Docker, supports future iterations. In this way, LwMDB provides a foundation for humanities research made possible by powerful new metadata.

Le Journal of Open Humanities Data (JOHD) : enjeux et défis dans la publication de data papers pour les sciences humaines et sociales (SHS)

Published in Publier, partager, réutiliser les données de la recherche : les data papers et leurs enjeux, 2025

Dans ce chapitre nous présentons le data paper en tant que forme de publication qui vise à valoriser le travail de préparation et traitement des données, en adoptant la perspective des sciences humaines et sociales (SHS). Après avoir reconstruit une définition aussi complète que possible de « data paper », nous analysons son évolution au fil du temps. Nous essayons d’identifier les obstacles qui ont pu influencer la moindre fortune des data papers en SHS par rapport aux sciences dures jusqu’à présent, en attirant l’attention sur l’extrême hétérogénéité des sujets de recherche et du type de données en SHS. Ensuite, nous présentons le Journal of Open Humanities Data (JOHD), qui se consacre à la publication d’articles axés sur les données pour les SHS. Nous partageons notre expérience dans ce milieu, en présentant également une enquête sur l’impact des data papers sur la réutilisation des données qu’ils décrivent. Pour conclure, nous proposons un modèle pyramidal, dans lequel le data paper participe à la valorisation des résultats de recherche, ainsi que du travail de curation et analyse des données, au vu des valeurs de science ouverte et de partage des données.

Natural Language Processing for Ancient Greek: Design, advantages and challenges of language models

Published in Diachronica, 2024

Computational methods have produced meaningful and usable results to study word semantics, including semantic change. These methods, belonging to the field of Natural Language Processing, have recently been applied to ancient languages; in particular, language modelling has been applied to Ancient Greek, the language on which we focus. In this contribution we explain how vector representations can be computed from word co-occurrences in a corpus and can be used to locate words in a semantic space, and what kind of semantic information can be extracted from language models. We compare three different kinds of language models that can be used to study Ancient Greek semantics: a count-based model, a word embedding model and a syntactic embedding model; and we show examples of how the quality of their representations can be assessed. We highlight the advantages and potential of these methods, especially for the study of semantic change, together with their limitations.

Mapping ‘when’-clauses in Latin American and Caribbean languages: an experiment in subtoken-based typology

Published in Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024), 2024

Languages can encode temporal subordination lexically, via subordinating conjunctions, and morphologically, by marking the relation on the predicate. Systematic cross-linguistic variation among the former can be studied using well-established token-based typological approaches to token-aligned parallel corpora. Variation among different morphological means is instead much harder to tackle and therefore more poorly understood, despite being predominant in several language groups. This paper explores variation in the expression of generic temporal subordination (‘when’-clauses) among the languages of Latin America and the Caribbean, where morphological marking is particularly common. It presents probabilistic semantic maps computed on the basis of the languages of the region, thus avoiding bias towards the many world’s languages that exclusively use lexified connectors, incorporating associations between character n-grams and English when. The approach allows capturing morphological clause-linkage devices in addition to lexified connectors, paving the way for larger-scale, strategy-agnostic analyses of typological variation in temporal subordination.

Language of Mechanisation Crowdsourcing Datasets from the Living with Machines Project

Published in Journal of Open Humanities Data, 2024

We present the ‘Language of Mechanisation’ datasets with examples of re-use in visualisations and analysis. These reusable CSV files, published on the British Library’s Research Repository, contain automatically-transcribed text from 19th century British newspaper articles. Volunteers on the Zooniverse crowdsourcing platform took part in tasks that asked ‘How did the word x change over time and place?’ They annotated articles with pre-selected meanings (senses) for the words coach, car, trolley and bike. The datasets can support scholarship on a range of historical and linguistic research areas, including research on crowdsourcing and online volunteering behaviours, data processing and data visualisations methodologies.

The semantic map of when and its typological parallels

Published in Frontiers in Communication, 2023

In this paper, we explore the semantic map of the English temporal connective when and its parallels in more than 1,000 languages drawn from a parallel corpus of New Testament translations. We show that there is robust evidence for a cross-linguistic distinction between universal and existential WHEN. We also see tentative evidence that innovation in this area involves recruiting new items for universal WHEN which gradually can take over the existential usage. Another possible distinction that we see is between serialized events, which tend to be expressed with non-lexified constructions and framing/backgrounding constructions, which favor an explicit subordinator.

Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work

Published in Proceedings of the Ancient Language Processing Workshop associated with RANLP-2023, 2023

We evaluate four count-based and predict-ive distributional semantic models of Ancient Greek against AGREE, a composite benchmark of human judgements, to assess their ability to retrieve semantic relatedness. On the basis of the observations deriving from the analysis of the results, we design a procedure for a larger-scale intrinsic evaluation of count-based and predictive language models, including syntactic embeddings. We also propose possible ways of exploiting the different layers of the whole AGREE benchmark (including both human-and machine-generated data) and different evaluation metrics.

Machines in the media: semantic change in the lexical field of mechanization in 19th-century British newspapers

Published in Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities (NLP4DH), 2022

The industrialization process associated with the so-called Industrial Revolution in 19thcentury Great Britain was a time of profound changes, including in the English lexicon. An important yet understudied phenomenon is the semantic shift in the lexicon of mechanisation. In this paper we present the first large-scale analysis of terms related to mechanization over the course of the 19th century in English. We draw on a corpus of historical British newspapers comprising 4.6 billion tokens and train historical word embedding models. We test existing semantic change detection techniques and analyse the results in light of previous historical linguistic scholarship.

Deep Impact: A Study on the Impact of Data Papers and Datasets in the Humanities and Social Sciences

Published in Publications, 2022

The humanities and social sciences (HSS) have recently witnessed an exponential growth in data-driven research. In response, attention has been afforded to datasets and accompanying data papers as outputs of the research and dissemination ecosystem. In 2015, two data journals dedicated to HSS disciplines appeared in this landscape: Journal of Open Humanities Data (JOHD) and Research Data Journal for the Humanities and Social Sciences (RDJ). In this paper, we analyse the state of the art in the landscape of data journals in HSS using JOHD and RDJ as exemplars by measuring performance and the deep impact of data-driven projects, including metrics (citation count; Altmetrics, views, downloads, tweets) of data papers in relation to associated research papers and the reuse of associated datasets. Our findings indicate: that data papers are published following the deposit of datasets in a repository and usually following research articles; that data papers have a positive impact on both the metrics of research papers associated with them and on data reuse; and that Twitter hashtags targeted at specific research campaigns can lead to increases in data papers’ views and downloads. HSS data papers improve the visibility of datasets they describe, support accompanying research articles, and add to transparency and the open research agenda.

One question, different annotation depths: A case study in Early Slavic

Published in Journal of Historical Syntax, 2022

This paper addresses some of the challenges of carrying out corpus-based linguistic analyses on historical corpora of different sizes and annotation depths. Data from the TOROT Treebank is collected to carry out a case study on Early Slavic dative absolutes, showing the extent to which methodology and results may change depending on the amount of data and the levels of linguistic annotation available. The analysis indicates that deeply-annotated treebanks of limited size can be exploited to establish a solid guideline to analyze a phenomenon in shallowly-annotated corpora and even new, unannotated texts. This is particularly encouraging for historical languages, such as Early Slavic, showing very high diatopic and diachronic variation, which significantly undermines corpus-annotation automation and therefore calls for alternative strategies to counteract data scarcity.

OldSlavNet: A scalable Early Slavic dependency parser trained on modern language data

Published in Software Impacts, 2021

Historical languages are increasingly being modelled computationally. Syntactically annotated texts are often a sine-qua-non in their modelling, but parsing of pre-modern language varieties faces great data sparsity, intensified by high levels of orthographic variation. In this paper we present a good-quality Early Slavic dependency parser, attained via manipulation of modern Slavic data to resemble the orthography and morphosyntax of pre-modern varieties. The tool can be deployed to expand historical treebanks, which are crucial for data collection and quantification, and beneficial to downstream NLP tasks and historical text mining.

Exploiting cross-dialectal gold syntax for low-resource historical languages: Towards a generic parser for pre-modern Slavic

Published in CEUR Workshop Proceedings, 2020

This paper explores the possibility of improving the performance of specialized parsers for pre- modern Slavic by training them on data from different related varieties. Because of their linguistic heterogeneity, pre-modern Slavic varieties are treated as low-resource historical languages, whereby cross-dialectal treebank data may be exploited to overcome data scarcity and attempt the training of a variety-agnostic parser. Previous experiments on early Slavic dependency parsing are discussed, particularly with regard to their ability to tackle different orthographic, regional and stylistic features. A generic pre-modern Slavic parser and two specialized parsers – one for East Slavic and one for South Slavic – are trained using jPTDP [8], a neural network model for joint part-of-speech (POS) tagging and dependency parsing which had shown promising results on a number of Universal Dependency (UD) treebanks, including Old Church Slavonic (OCS). With these experiments, a new state of the art is obtained for both OCS (83.79% unlabelled attachment score (UAS) and 78.43% labelled attachment score (LAS)) and Old East Slavic (OES) (85.7% UAS and 80.16% LAS).