Open datasets and software

[HuggingFace] BERTislav: a BERT-based fill-mask Early Slavic language model
[HuggingFace] OldSlavNet: model and data
[HuggingFace] Binary RoBERTa-based classifier fine-tuned on historical British newspaper articles reporting of suicide, to distinguish between (confirmed or speculated) suicide cases, investigations, or court cases
[Figshare] Replication data and code for: Mapping 'when'-clauses in Latin American and Caribbean languages: an experiment in subtoken-based typology
[Zenodo] Early Slavic language models
[Zenodo] Ancient Greek language models
[Figshare] Replication data for: A quantitative and typological study of Early Slavic participle clauses and their competition (University of Oxford, DPhil Thesis)
[Figshare] Code and data for The Semantic Map of When and its Typological Parallels
[GitHub] DiachronicEmb-BigHistData: Tools to train and explore diachronic word embeddings from Big Historical Data
[GitHub] OldSlavNet (Early Slavic dependency parser)
[GitHub] Introduction to Text Mining: Jupyter notebooks (used as teaching material for the MSc in Digital Scholarship at the University of Oxford, October 2023)
[Zenodo] Decade-level Word2Vec models from automatically transcribed 19th-century newspapers digitised by the British Library (1800-1919)
[Zenodo] Diachronic and diatopic word embeddings from newspapers digitised by the British Library (1830-1889): North and South England
[GitHub] Python scripts to train and explore syntactic (graph-based, Node2Vec) word embeddings for Ancient Greek
[Zenodo] DataPapersAnalysis: Scripts to carry out impact analysis on the publication metrics related to the Journal of Open Humanities Data and the Research Data Journal for the Humanities and Social Sciences
[Figshare] Data from 'One question, different annotation depths: A case study in Early Slavic'
[Figshare] Data from 'Exploiting cross-dialectal gold syntax for low-resource historical languages: towards a generic parser for pre-modern Slavic'
[GitHub] Word-alignment models for Bible translations in 100+ historical and contemporary languages and scripts to train them
[GitHub] Mixed drafts, scripts or data useful for NLP tasks on Pre-Modern Slavic.