
Training set: language varieties

Below we report details of each text included as training data for OldSlavNet. This is to provide users who wish to replicate or improve our results with a transparent dataset breakdown.

If you are not satisfied with OldSlavNet’s parsing performance on your particular texts, you are very welcome to send us any suggestions (here) on how we might improve it, particularly on the basis of expert evaluation of the data themselves (e.g. distribution and representativeness of different pre-modern Slavic varieties and Church Slavonic recensions). Alternative, you may want to train your own model!

The following are all the texts used as training data, their language variety, and number of tokens. The proposed language classification is by no means definitive, since several texts present mixed regional features and are contained in much later manuscripts than the varieties they represent as a whole.

Variety Text Tokens
OCS Codex Marianus 58,269
Codex Suprasliensis 79,070
Codex Zographensis 1,098
Kiev Missal 370
Psalterium Sinaiticum 248
SCS Vita Constantini 890
RCS Vita Methodii 331
OES Primary Chronicle (Codex Laurentianus) 56,725
Suzdal Chronicle (Codex Laurentianus) 23,760
Primary Chronicle (Codex Hypathianus) 3,610
First Novgorod Chronicle (Synodal) 17,838
Kiev Chronicle (Codex Hypathianus) 544
Colophon (Mstislav’s Gospel) 259
Colophon (Ostromir Codex) 199
Missive (Archbishop of Riga) 171
Mstislav’s letter 158
Novgorod’s treaty with Jaroslav 423
Russkaja pravda 4,174
Statute of Prince Vladimir 495
Treaty (Smolensk-Riga-Gotland) 1,421
The Tale of Igor’s Campaign 2,850
Russkaja pravda 4,174
Uspenskij Sbornik (excerpts) 25,189
Varlaam Xutynskij’s Grant Charter 148
MRus Afanasij Nikitin’s \textit{Journey} 6,842
Charter of Prince Jurij Svjatoslavich 344
Correspondence of Peter the Great 100
Domostroj 23,459
Life of Sergij of Radonezh 20,361
History of the schism (materials) 1,835
Missive (Ivan of Pskov) 339
Testament (Ivan Jur’evič Graznoj) 421
Life of Avvakum 22,835
Tale of Dracula 2,487
The tale of Luka Koločskij 906
The taking of Pskov 2,326
The tale of the fall of Constantinople 9,258
Vesti-Kuranty 1,154
Zadonščina 2,399
ONov Birchbark letters 1,965
Novgorod service book marginalia 93
Novgorodians' losses 187
ModRus SynTagRus 18,355
BCS UD Serbian-SET 17,622