How do the languages of the world encode situations which English expresses with when-subordinates? How do we explain cases in which both language x and language y have a literal translation for when, but it cannot always be used to translate the one in the other language? Can we find a finite number of 'patterns' among the world's (7000+!) languages within which variation in usage can occur?
Traditionally, we rely on a corpus of parallel sentences in n languages, each manually annotated for the linguistic categories of interest.
- How to scale up to much more than the handful of languages which one could manually annotate?
- How do we use a corpus to investigate a phenomenon (e.g. when-clauses worldwide) in a data-driven way, i.e. without prior hypotheses constraining the type of outcomes we could get (as would be the case if we annotated all when-clauses with a pre-determined set of semantic tags)?
- It would be impossible, for an individual researcher, to be familiar with all the languages in a very large dataset. Can we discover patterns and group language types also among languages which the researcher is not familiar with?
The Bible is the most widely translated book in history. To scale up to as many languages as currently possible, I used Mayer & Cysouw's (2014) 'massively parallel' Bible corpus, which contains the New Testament in more than 1400 languages. To add some representativeness from ancient languages as well, I have also included the Latin, Classic Armenian, Gothic, and Old Church Slavonic translations of the New Testament from Haug & Jøhndal's (2008) Pragmatic Resources in Old Indo-European Languages (PROIEL ) treebank. Several of the roughly 1400 languages in the corpus had multiple translations and some contain only (or predominantly) the Old Testament. Only languages that had at least some coverage for the New Testament were considered. The dataset creation process consisted of the following steps:
- Filter out languages that only have the Old Testament.
- Include all other languages with only one translation of the New Testament.
- Of all languages with multiple New Testament versions, include the version with the broadest coverage (with a deviation of more than 2000 verses).
- If multiple versions with a similar coverage exist, then take the most recent one.
(Pre-)processing and alignment at token-level
For each of the target languages a word-alignment model was trained using Dowmunt & Szał's (2012) SyMGIZA++ , after testing its performance against that of other alignment algorithms, such as GIZA++ and FastText. Minimal preprocessing (lowercasing and punctuation removal) was applied before training.
Similarity matrix and dimensionality reduction
After extracting when and its parallels in all the target languages, Hamming distance was used as a measure of similarity between pairs of contexts, leveraging the number of times a language uses two different linguistic means where English uses one. Multidimensional scaling (MDS) was then applied to the resulting matrix to visualize the distance between each occurrence of when two- and three-dimensionally. Below is the 3D example for Spanish. Toggle on/off individual words to include or exclude them from the map, hover the data points to see the English-Spanish parallel contexts, and drag the plot to see it from different perspectives:
From individual data points to universals
Clusters of semantically similar observations were identified in two main ways. First, by fitting Gaussian Mixture Model (GMM) to the MDS matrix, to identify clusters which are more likely to correspond to separate universal functions of when, regardless of how much variation a particular language shows within any of the clusters (which could go from no variation across the whole map or across one cluster, to several linguistic means in a single cluster):
The number of clusters in GMM are chosen with the help of different heuristics. In the graphs below we can see that the optimal number (based on the first two dimensions of the MDS matrix) is 3, although in our methodology Elbow methods must be employed more flexibly than in classic classification tasks. These methods are meant to indicate how many clusters can be considered maximally separate from each other. However, empirically, we know that the temporal constructions under consideration are always competing with each other to some extent and that their scopes are not at all clear-cut. In fact, for at least two of the three methods (the Silhouette and Davies Bouldin scores), we can see that it is only after 6 that the clustering would become more clearly overfitted to the data, while a third method (Distortion score) does not even clearly suggest so until 9 (the slope becomes less steep, but does not rise!).
From universals to language-internal variation
Starting once again from the MDS matrix, we can apply Kriging as an interpolation method that uses a limited set of sampled data points (each observation in the target languages) to estimate the value of a variable in an unsampled location. Unlike the GMM clustering, which predicts distinct clusters regardless of individual languages, Kriging helps identify areas that are likely to be the domain of a specific linguistic means in a particular language. The degree to which an area identified by Kriging overlaps with a GMM-identified cluster can tell us if that particular language has a dedicated means for the function of that particular cluster, whether it uses more than one means, whether one means extends over more than one cluster, or whether it has no dedicated means at all. As a minimal example, compare the maps for Norwegian and Kako below: the former co-lexicalizes the clusters corresponding to Kako ma and ŋgimɔ, which in turn correspond to separate clusters in any of the Gaussian Mixture Models plots with k > 3.
This is a project in collaboration with Prof. Dag Haug (Oslo).
Thomas Mayer and Michael Cysouw. 2014. Creating a massively parallel Bible corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3158–3163, Reykjavik, Iceland. European Language Resources Association (ELRA).
Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the old Indo-European Bible translations. In ,Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), 27–34.
Junczys-Dowmunt, Marcin & Arkadiusz Szał. 2012. Symgiza++: Symmetrized word alignment models for machine translation. In Pascal Bouvry, Mieczyslaw A. Klopotek, Franck Leprévost, Malgorzata Marciniak, Agnieszka Mykowiecka & Henryk Rybinski (eds.), Security and Intelligent Information Systems (SIIS) 7053 (Lecture Notes in Computer Science), 379–390. Warsaw, Poland: Springer.
Check out other projectsParallel BiblesTemporal subordination in 1400+ languages of the worldMachines in the mediaSemantic change in the era of mechanizationOldSlavNetA scalable dependency parser for pre-modern SlavicAncient Greek graph-based syntactic embeddingsSyntactic word representations for Ancient Greek