Train a new model

Requirements

To train your own parser, you will need:

  • train.conllu: the training data. Put this file inside the folder ./training_data/new/
  • dev.conllu: the development data. Put this file inside the folder ./training_data/new/
  • (Optional) validation and gold data: while the training script will still work without these data, it is of course good practice to test the parser’s performance on a small batch of unseen data representative of the kind of texts you may want to use the parser for. Put any number of ‘gold-standard’ texts for which you wish to check the parser’s performance individually inside the folder ./test_data/gold/, and a copy of each of these inside ./test_data/tobeannotated/

After adding these files to the right directories, the project directory should look like the following (note the placement of train.conllu, dev.conllu, and of the gold-test data, named text1.conllu and text2.conllu):

📦ROOT
 ┣ 📂models
 ┃ ┗ 📂OldSlavNet
 ┃   ┣ 📜model
 ┃   ┗ 📜model.params
 ┣ 📂oldslavnet-venv
 ┣ 📂scripts
 ┣ 📂test_data
 ┃ ┣ 📂annotated
 ┃ ┣ 📂gold
 ┃ ┃ ┣ 📜text1.conllu
 ┃ ┃ ┗ 📜text2.conllu
 ┃ ┗ 📂tobeannotated
 ┃   ┣ 📜text1.conllu
 ┃   ┗ 📜text2.conllu
 ┣ 📂training_data
 ┃ ┣ 📂new
 ┃ ┃ ┣ 📜dev.conllu
 ┃ ┃ ┗ 📜train.conllu
 ┃ ┗ 📂past
 ┃   ┗ 📂OldSlavNet
 ┃     ┣ 📜dev.conllu
 ┃     ┗ 📜train.conllu
 ┣ 📜LICENSE
 ┣ 📜Makefile
 ┣ 📜README.md
 ┣ 📜requirements.txt
 ┣ 📜tag.sh
 ┗ 📜train.sh

Train the model

From the ROOT directory, run:

./train.sh

You will be prompted to enter:

  1. A name for your new model (e.g. newmodel1)
  2. Your choice of hyperparameters (press Enter to keep default):
    • training epochs
    • number of BiLSTM dimensions
    • number of BiLSTM layers
    • size of MLP hidden layer
    • size of word embeddings
    • size of character embeddings
    • size of POS tag embeddings

This will:

  • Create a folder under ./models/ named after the name you entered for your model, where the trained model itself (the model and model.params files) will be saved
  • Move your training and development data from ./training_data/new/ to a new folder under ./training_data/past/ named after the name you entered for your model
  • Train your model
  • Annotate the texts in ./test_data/tobeannotated/, compare them with those with the same name under ./test_data/gold/ and generate a text file for each of them with performance metrics under ./models/yourmodelname/validation-output/

After the model has been trained, the project directory should look like the following:

📦ROOT
 ┣ 📂models
 ┃ ┣ yourmodelname
 ┃ ┃ ┣ 📜model
 ┃ ┃ ┣ 📜model.params
 ┃ ┃ ┗ 📂validation-output
 ┃ ┃   ┣ 📜text1-validated.txt
 ┃ ┃   ┗ 📜text2-validated.txt
 ┃ ┗ 📂OldSlavNet
 ┃   ┣ 📜model
 ┃   ┗ 📜model.params
 ┣ 📂oldslavnet-venv
 ┣ 📂scripts
 ┣ 📂test_data
 ┣ 📂training_data
 ┃ ┣ 📂new
 ┃ ┗ 📂past
 ┃   ┗ 📂yourmodelname
 ┃     ┣ 📜dev.conllu
 ┃     ┗ 📜train.conllu
 ┃   ┗ 📂OldSlavNet
 ┃     ┣ 📜dev.conllu
 ┃     ┗ 📜train.conllu
 ┣ 📜LICENSE
 ┣ 📜Makefile
 ┣ 📜README.md
 ┣ 📜requirements.txt
 ┣ 📜tag.sh
 ┗ 📜train.sh