Parallel Bibles

Question

How do the world's languages encode situations which English expresses with when-subordinates? How do we explain cases in which both language x and language y have a literal translation for when, but they cannot always use it to translate the when-counterpart in the other language? Can we find a finite number of 'patterns' among the world's (7000+!) languages within which variation in usage can occur?
Challenge

Traditionally, we rely on a corpus of parallel sentences in n languages, each manually annotated for the linguistic categories of interest.
1. How to scale up to much more than the handful of languages which one could manually annotate?
2. How do we use a corpus to investigate a phenomenon (e.g. when-clauses worldwide) in a data-driven way, i.e. without prior hypotheses constraining the type of outcomes we could get (as would be the case if we annotated all when-clauses with a pre-determined set of semantic tags)?
3. It would be impossible, for an individual researcher, to be familiar with all the languages in a very large dataset. Can we discover patterns and group language types also among languages which the researcher is not familiar with?
Data

The Bible is the most widely translated book in history. To scale up to as many languages as currently possible, we used Mayer & Cysouw's (2014) 'massively parallel' Bible corpus, which contains the New Testament in over 1400 languages. To add some representativeness from ancient languages as well, we also included the Latin, Classic Armenian, Gothic, and Old Church Slavonic translations of the New Testament from Haug & Jøhndal's (2008) Pragmatic Resources in Old Indo-European Languages (PROIEL ) treebanks. Several of the 1400+ languages in the corpus had multiple translations and some contain only (or predominantly) the Old Testament. Only languages that had at least some coverage for the New Testament were considered. The dataset creation process consisted of the following steps:
1. Filter out languages that only have the Old Testament.
2. Include all other languages with only one translation of the New Testament.
3. Of all languages with multiple New Testament versions, include the version with the broadest coverage (with a deviation of more than 2000 verses).
4. If multiple versions with a similar coverage exist, then take the most recent one.
(Pre-)processing and alignment at token-level

For each of the target languages a word-alignment model was trained using Dowmunt & Szał's (2012) SyMGIZA++ , after testing its performance against that of other alignment algorithms, such as GIZA++ and FastText. Minimal preprocessing (lowercasing and punctuation removal) was applied before training.
Similarity matrix and dimensionality reduction

After extracting when and its parallels in all the target languages, Hamming distance was used as a measure of similarity between pairs of contexts, leveraging the number of times a language uses two different linguistic means where English uses one. Multidimensional scaling (MDS) was then applied to the resulting matrix to visualize the distance between each occurrence of when two- and three-dimensionally. The further away two points in the map, the more different the usage/meaning of the respective when-clause can be assumed to be. Below is the 3D example for Spanish. Toggle on/off individual words to include or exclude them from the map, hover the data points to see the English-Spanish parallel contexts, and drag the plot to see it from different perspectives:
From individual data points to universals

Clusters of semantically similar observations were identified in two main ways. First, by fitting Gaussian Mixture Model (GMM) to the MDS matrix, which help us identify clusters which are more likely to correspond to separate universal functions of when, regardless of how much variation a particular language shows within any of the clusters (which could go from no variation across the whole map or across one cluster, to several linguistic means in a single cluster):
The number of clusters in GMM are chosen with the help of different heuristics. In the graphs below we can see that the optimal number (based on the first two dimensions of the MDS matrix) is 3, although in our methodology Elbow methods must be employed more flexibly than in classic classification tasks. These methods are meant to indicate how many clusters can be considered maximally separate from each other. However, empirically, we know that the temporal constructions under consideration are always competing with each other to some extent and that their scopes are not at all clear-cut. In fact, for at least two of the three methods (the Silhouette and Davies Bouldin scores), we can see that it is only after 6 that the clustering would become more clearly overfitted to the data, while a third method (Distortion score) does not even clearly suggest so until 9 (the slope becomes less steep, but does not rise!).
From universals to language-internal variation

Starting once again from the MDS matrix, we can apply Kriging as an interpolation method that uses a limited set of sampled data points (each observation in the target languages) to estimate the value of a variable in an unsampled location. Unlike the GMM clustering, which predicts distinct clusters regardless of individual languages, Kriging helps identify areas that are likely to be the domain of a specific linguistic means in a particular language. The degree to which an area identified by Kriging overlaps with a GMM-identified cluster can tell us if that particular language has a dedicated means for the function of that particular cluster, whether it uses more than one means, whether one means extends over more than one cluster, or whether it has no dedicated means at all. As a minimal example, compare the maps for Norwegian and Kako below: the former co-lexicalizes the clusters corresponding to Kako ma and ŋgimɔ, which in turn correspond to separate clusters in any of the Gaussian Mixture Models plots with k > 3.
More results

Curious about how maps such as the Norwegian and Kako ones above were used to analyse all the other languages in the corpus typologically? Check out the article The semantic map of when and its typological parallels cowritten with Dag Haug.
Want to know how we could automatically detect and represent variation among constructions encoded at the morphological, subtoken level in the semantic maps? Check out my experiments on when-clauses and switch-reference morphology described in the article Mapping 'when'-clauses in Latin American and Caribbean languages: an experiment in subtoken-based typology.
Credits

This is a project in collaboration with Prof. Dag Haug (Oslo).
References

Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the old Indo-European Bible translations. In ,Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), 27–34.
Haug, Dag & Nilo Pedrazzini. 2023. The semantic map of when and its typological parallels. Frontiers in Communication 8.
Junczys-Dowmunt, Marcin & Arkadiusz Szał. 2012. Symgiza++: Symmetrized word alignment models for machine translation. In Pascal Bouvry, Mieczyslaw A. Klopotek, Franck Leprévost, Malgorzata Marciniak, Agnieszka Mykowiecka & Henryk Rybinski (eds.), Security and Intelligent Information Systems (SIIS) 7053 (Lecture Notes in Computer Science), 379–390. Warsaw, Poland: Springer.
Mayer, Thomas and Michael Cysouw. 2014. Creating a massively parallel Bible corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3158–3163, Reykjavik, Iceland. European Language Resources Association (ELRA).
Pedrazzini, Nilo. 2024. Mapping 'when'-clauses in Latin American and Caribbean languages: an experiment in subtoken-based typology. Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024), pages 24–33, Mexico City, Mexico. Association for Computational Linguistics.
Check out other projects

Parallel Bibles
Temporal subordination in 1400+ languages

Machines in the media
Semantic change in the era of mechanization

OldSlavNet
A scalable dependency parser for pre-modern Slavic

Ancient Greek graph-based syntactic embeddings
Syntactic word representations for Ancient Greek

Parallel Bibles

Question

Challenge

Data

(Pre-)processing and alignment at token-level

Similarity matrix and dimensionality reduction

From individual data points to universals

From universals to language-internal variation

More results

Credits

References

Check out other projects