OldSlavNet > Docs > Data preprocessing

Data preprocessing

Input format

OldSlavNet requires the Universal Dependencies 10-column CoNLL-U format as input. A ready-to-be-tagged CoNLL-U file may look like the following:

1	да	_	_	_	_	_	_	_	_
2	тѹтъ	_	_	_	_	_	_	_	_
3	есми	_	_	_	_	_	_	_	_
4	жил	_	_	_	_	_	_	_	_
5	в	_	_	_	_	_	_	_	_
6	чебокарѣ	_	_	_	_	_	_	_	_
7	ѕ҃	_	_	_	_	_	_	_	_
8	мсць	_	_	_	_	_	_	_	_

1	да	_	_	_	_	_	_	_	_
2	в	_	_	_	_	_	_	_	_
3	сарѣ	_	_	_	_	_	_	_	_
4	жил	_	_	_	_	_	_	_	_
5	мсць	_	_	_	_	_	_	_	_
6	в	_	_	_	_	_	_	_	_
7	маздраньскои	_	_	_	_	_	_	_	_
8	земли	_	_	_	_	_	_	_	_

Since OldSlavNet will only add annotation to columns 4 (part of speech), 7 (parent node’s index) and 8 (dependency relation), the other columns may already be filled and will not be overwritten (e.g. if you have already added lemmatization and morphological analysis). Any comment preceding each sentence will be disregarded. A partially filled, ready-to-be-annoted input file may then also look like this:

# source = Afanasij Nikitin’s journey beyond three seas, 7
# text = во индѣискои земли гости сѧ ставѧть по подворьемь
# sent_id = 195775
1	во	въ	_	_	_	_	_	_	_
2	индѣискои	индѣискыи	_	_	Case=Loc|Degree=Pos|Gender=Fem|Number=Sing|Strength=Weak	_	_	_	_
3	земли	земля	_	_	Case=Loc|Gender=Fem|Number=Sing	_	_	_	_
4	гости	гость	_	_	Case=Nom|Gender=Masc|Number=Plur	_	_	_	_
5	сѧ	себе	_	_	Case=Acc|Number=Sing|Person=3	_	_	_	_
6	ставѧть	ставити	_	_	Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	_	_	_	_
7	по	по	_	_	_	_	_	_	_
8	подворьемь	подворие	_	_	Case=Dat|Gender=Neut|Number=Plur	_	_	_	_

# source = Afanasij Nikitin’s journey beyond three seas, 7
# text = а ѣсти варѧть на гости господарыни и постелю стелють и спѧть с гостьми
# sent_id = 195118
1	а	а	_	_	_	_	_	_	_
2	ѣсти	ѣсти	_	_	Tense=Pres|VerbForm=Inf|Voice=Act	_	_	_	_
3	варѧть	варити	_	_	Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	_	_	_	_
4	на	на	_	_	_	_	_	_	_
5	гости	гость	_	_	Case=Acc|Gender=Masc|Number=Plur	_	_	_	_
6	господарыни	господарыни	_	_	Case=Nom|Gender=Fem|Number=Plur	_	_	_	_
7	и	и	_	_	_	_	_	_	_
8	постелю	постеля	_	_	Case=Acc|Gender=Fem|Number=Sing	_	_	_	_
9	стелють	стьлати	_	_	Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	3	_	_	_
10	и	и	_	_	_	_	_	_	_
11	спѧть	съпати	_	_	Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	_	_	_	_
12	с	съ	_	_	_	_	_	_	_
13	гостьми	гость	_	_	Case=Ins|Gender=Masc|Number=Plur	_	_	_	_

Convert a text file to CoNLL-U

Start from a tokenized and sentencized text file containing your to-be input file, looking like the following (i.e. one sentence per line, with tokens separated by white space, including any punctuation sign):

да тѹтъ есми жил в чебокарѣ ѕ҃ мсць
да в сарѣ жил мсць в маздраньскои земли
а ѿтѹды ко амили
и тѹтъ жилъ есми мсць
а ѿтѹды к димовантꙋ
а из димовантѹ ко рѣю
а тѹ ꙋбили шаꙋсенѧ алеевых детеи и внѹчатъ махметевых .
и ѡнъ их проклѧлъ
ино о҃ городовъ сѧ розвалило
а изд рѣѧ к кашени
и тѹтъ есми был мсць .
а из кашени к наинѹ
а из наина ко ездѣи
и тѹтъ жилъ есми мсць
а из диесъ къ сырчанѹ

Assuming such text file is called text1 (avoid adding .txt extension to the file name), run the following:

python3 ./scripts/preprocess/converter.py path/to/text1

This will generate a text1.conllu output ready to be annotated by OldSlavNet.