Multilinguality in Language Modeling: Tasks, Data, and Opportunities for Typological Resources
- Authors: Shavrina T.O.1, Kornilov A.A.2
-
Affiliations:
- Institute of Linguistics of the Russian Academy of Sciences
- Higher School of Economics
- Issue: No 4(123) (2025)
- Pages: 122-135
- Section: PARALLELS BETWEEN NATURAL AND ARTIFICIAL INTELLIGENCE
- URL: https://ogarev-online.ru/2587-6090/article/view/367526
- DOI: https://doi.org/10.22204/2587-8956-2025-123-04-122-135
- ID: 367526
Cite item
Full Text
Abstract
This paper addresses the significant challenge of building language technologies for the majority of the world's under-resourced languages, which lack the large text corpora and annotated datasets necessary for modern machine learning. While advances in Large Language Models (LLMs) have revolutionized machine translation and reading comprehension, these models often underperform or fail entirely for languages with limited written resources. We present an overview of current multilingual support in LLMs and evaluate their ability to understand the primary available knowledge source for such languages: descriptive grammars. To effectively utilize this structured but complex information, we propose a Retrie valAugmented Generation (RAG) framework. This approach enables models to accurately extract and interpret linguistic features from grammatical texts, facilitating downstream tasks like machine translation. Our evaluation provides the first comprehensive assessment of model performance on this critical task, covering grammatical descriptions of 248 languages from 142 language families. The analysis focuses on the typological characteristics of the WALS [1] and Grambank [2] databases. The proposed approach demonstrates the first comprehensive assessment of the ability of language models to accurately interpret and extract linguistic features in context, creating a critical resource for scaling technologies to under-resourced languages. Code and data from this study are made publicly available: https://github.com/al-the-eigenvalue/RAG-on-grammars.
Keywords
About the authors
T. O. Shavrina
Institute of Linguistics of the Russian Academy of Sciences
Author for correspondence.
Email: rybolos@gmail.com
kandidate of Philology, senior researcher
Russian Federation, MoscowA. A. Kornilov
Higher School of Economics
Email: albert.kornilov801@gmail.com
bachelor
Russian Federation, MoscowReferences
- Dryer M.S., Haspelmath M. (eds.). WALS Online (v2020.4) [Data set]. Zenodo, 2013. doi: 10.5281/zenodo.13950591.
- Skirgård H., Haynie H., Passmore S. et al. Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss // Science Advances. 2023. Vol. 9. № 16. Article eadg6175. doi: 10.1126/sciadv.adg6175.
- Ebrahimi A. et al. Findings of the AmericasNLP 2023 Shared Task on Machine Translation into Indigenous Languages // Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP). Toronto, Canada: Association for Computational Linguistics, 2023. P. 206–219.
- Lovenia H. et al. SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). Miami, USA: Association for Computational Linguistics, 2024. P. 5155–5203.
- Nekoto W. et al. Participatory Research for Low-Resourced Machine Translation: A Case Study in African Languages // Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, 2020. P. 2144–2160.
- Winata G.I. et al. NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages // Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Dubrovnik, Croatia: Association for Computational Linguistics, 2023. P. 815–834.
- Bapna A. et al. Building machine translation systems for the next thousand languages. arXiv preprint arXiv:2205.03983, 2022.
- Chen W. et al. Towards Robust Speech Representation Learning for Thousands of Languages // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). Miami, USA: Association for Computational Linguistics, 2024. P. 10205–10224.
- Garrette D., Mielens J., Baldridge J. Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages // Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013). Vol. 1: Long Papers. Sofia, Bulgaria: Association for Computational Linguistics, 2013. P. 583–592.
- Tanzer G., Suzgun M., Visser E., Jurafsky D., Melas-Kyriazi L. A benchmark for learning to translate a new language from one grammar book. arXiv preprint arXiv:2309.16575, 2023.
- Muennighoff N., Tazi N., Magne L., Reimers N. MTEB: Massive Text Embedding Benchmark // Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Dubrovnik, Croatia: Association for Computational Linguistics, 2023. P. 2014–2037.
- Lewis P., Perez E., Piktus A., Petroni F., Karpukhin V., Goyal N., Kiela D. Retrieval-augmented generation for knowledge-intensive NLP tasks // Advances in Neural Information Processing Systems (NeurIPS). 2020. Vol. 33. P. 9459–9474.
- Zhang K., Choi Y., Song Z., He T., Wang W.Y., Li L. Hire a Linguist!: Learning Endangered Languages in LLMs with In-Context Linguistic Descriptions // Findings of the Association for Computational Linguistics: ACL 2024. Bangkok, Thailand: Association for Computational Linguistics, 2024. P. 15654–15669.
- Ponti E.M., Glavaš G., Majewska O., Liu Q., Vulić I., Korhonen A. XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning // Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020. P. 2362–2376.
- Virk S.M., Foster D., Sheikh M.A., Saleem R. A Deep Learning System for Automatic Extraction of Typological Linguistic Information from Descriptive Grammars // Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021). Online: INCOMA Ltd., 2021. P. 1480–1489.
- Hammarström H., Her O.-S., Allassonnière-Tang M. Term spotting: A quick-and-dirty method for extracting typological features of language from grammatical descriptions // Selected Contributions from the Eighth Swedish Language Technology Conference (SLTC-2020). 2020. P. 27–34.
- Kornilov A. Multilingual Automatic Extraction of Linguistic Data from Grammars // Proceedings of the Second Workshop on NLP Applications to Field Linguistics. Dubrovnik, Croatia: Association for Computational Linguistics, 2023. P. 86–94.
- Kornilov A., Shavrina T. From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars. arXiv preprint arXiv:2411.15577, 2024.
- Miestamo M., Bakker D., Arppe A. Sampling for variety // Linguistic Typology. 2016. Vol. 20. № 2. P. 233–296.
- Cheveleva A. Neutralization of gender values in the plural. Bachelor’s thesis. Moscow: HSE University, 2023.
- Wei J., Wang X., Schuurmans D., Bosma M., Xia F., Chi E., Le Q.V., Zhou D. et al. Chain-of-thought prompting elicits reasoning in large language models // Advances in Neural Information Processing Systems (NeurIPS). 2022. Vol. 35. P. 24824–24837.
- Hammarström H., Forkel R., Haspelmath M., Bank S. (eds.). Glottolog 5.0 [Data set]. Zenodo, 2024. doi: 10.5281/zenodo.8635585.
Supplementary files

