Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts?
- Autores: Morozov D.A.1, Garipov T.A.1, Lyashevskaya O.N.2,3, Savchuk S.O.3, Iomdin B.L.4, Glazkova A.V.5
-
Afiliações:
- Novosibirsk State University
- HSE University
- Vinogradov Russian Language Institute
- independent researcher
- University of Tyumen
- Edição: Volume 10, Nº 4 (2024)
- Páginas: 71-84
- Seção: Research Papers
- URL: https://ogarev-online.ru/2411-7390/article/view/356610
- DOI: https://doi.org/10.17323/jle.2024.22237
- ID: 356610
Citar
Resumo
Purpose: To compare various existing algorithms of morpheme segmentation for the Russian language and analyze their applicability in the task of automatic augmentation of various existing morpheme dictionaries.
Results: In this study, we compared several state-of-the-art machine learning algorithms using three datasets structured around different segmentation paradigms. Two experiments were carried out, each employing five-fold cross-validation. In the first experiment, we randomly partitioned the dataset into five subsets. In the second, we grouped all words sharing the same root into a single subset, excluding words that contained multiple roots. During cross-validation, models were trained on four of these subsets and evaluated on the remaining one. Across both experiments, the algorithms that relied on ensembles of convolutional neural networks consistently demonstrated the highest performance. However, we observed a notable decline in accuracy when testing on words containing unfamiliar roots. We also found that, on a randomly selected set of words, the performance of these algorithms was comparable to that of human experts.
Conclusion: Our results indicate that although automatic methods have, on average, reached a quality close to expert level, the lack of semantic consideration makes it impossible to use them for automatic dictionary expansion without expert validation. The conducted research revealed that further research should be aimed at addressing the key identified issues: poor performance with unknown roots and acronyms. At the same time, when a small number of unfamiliar roots can be assumed in the test dataset, an ensemble of convolutional neural networks should be utilized. The presented results can be used in the development of morpheme-oriented tokenizers and systems for analyzing the complexity of texts.
Sobre autores
Dmitry Morozov
Novosibirsk State University
Email: morozowdm@gmail.com
ORCID ID: 0000-0003-4464-1355
Novosibirsk, Russia
Timur Garipov
Novosibirsk State University
Email: t.garipov@g.nsu.ru
ORCID ID: 0009-0008-4527-2268
Novosibirsk, Russia
Olga Lyashevskaya
HSE University; Vinogradov Russian Language Institute
Email: olesar@yandex.ru
ORCID ID: 0000-0001-8374-423X
Moscow, Russia; Russian Academy of Sciences, Moscow, Russia
Svetlana Savchuk
Vinogradov Russian Language Institute
Email: savsvetlana@mail.ru
ORCID ID: 0000-0003-0464-7269
Russian Academy of Sciences, Moscow, Russia
Boris Iomdin
independent researcher
Email: lingnarod@gmail.com
ORCID ID: 0000-0002-1767-5480
Anna Glazkova
University of Tyumen
Email: a.v.glazkova@utmn.ru
ORCID ID: 0000-0001-8409-6457
Tyumen, Russia
Bibliografia
- Bakulina, G. A. (2012). Morfemnyy razbor slova: novye podkhody - novye vozmozhnosti [Morpheme segmentation: new approaches - new opportunities]. Nachal'naya shkola, (4), 29-32.
- Batsuren, K., Bella, G., Arora, A., Martinovic, V., Gorman, K., Žabokrtský, Z., Ganbold, A., Dohnalová, Š., Ševčíková, M., Pelegrinová, K., Giunchiglia, F., Cotterell, R., & Vylomova, E. (2022). The SIGMORPHON 2022 shared task on morpheme segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 103-116). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2022.sigmorphon-1.11
- Bodnár, J. (2022). JB132 submission to the SIGMORPHON 2022 shared task 3 on morphological segmentation. Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 152-156). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2022.sigmorphon-1.17
- Bolshakov, I.A. (2013). Krossleksika: Universum sviazi mezhdu russkimi slovami [Crosslexica: a universe of links between russian words]. Biznes-informatika, 3(25), 12-19.
- Bolshakova, E., Sapin, A. (2019). Bi-LSTM model for morpheme segmentation of russian words. In Ustalov, D., Filchenkov, A., Pivovarova, L. (Eds.), Artificial Intelligence and Natural Language. AINL 2019.Communications in Computer and Information Science (pp. 151-160). Springer. DOI:https://doi.org/10.1007/978-3-030-34518-1_11
- Bolshakova, E., Sapin, A. (2021). Building a Combined morphological model for Russian word forms. In Burnaev, E. et al. (Eds), Analysis of Images, Social Networks and Texts. AIST 2021. Lecture Notes in Computer Science (vol. 13217, pp. 45-55). Springer. DOI:https://doi.org/10.1007/978-3-031-16500-9_5
- Bolshakova, E.I., & Sapin, A.S. (2019).Comparing models of morpheme analysis for Russian words based on machine learning.Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference Dialogue 2019 (pp. 104-113).Russian State University for the Humanities.
- Creutz, M., & Lagus, K. (2002). Unsupervised discovery of morphemes. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning (pp. 21-30). Association for Computational Linguistics. DOI:https://doi.org/10.3115/1118647.1118650
- Cotterell, R., Vieira, T., & Schütze, H. (2016). A joint model of orthography and morphological segmentation. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 664-669). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/N16-1080
- Garipov, T., Morozov, D., & Glazkova, A. (2023). Generalization ability of CNN-based morpheme segmentation. 2023 Ivannikov Ispras Open Conference (ISPRAS) (pp. 58-62). IEEE. DOI:https://doi.org/10.1109/ISPRAS60948.2023.10508171
- Girrbach, L. (2022). SIGMORPHON 2022 shared task on morpheme segmentation submission description: Sequence labelling for word-level morpheme segmentation. Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 124-130). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2022.sigmorphon-1.13
- Grönroos, S.-A., Virpioja, S., & Kurimo, M. (2020). Morfessor EM+Prune: Improved subword segmentation with expectation maximization and pruning. Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3944-3953). European Language Resources Association.
- Imani, A., Lin, P., Kargaran, A. H., Severini, S., Sabet, M. J., Kassner, N., Ma, C., Schmid, H., Martins, A., Yvon, F., & Schütze, H. (2023). Glot500: Scaling multilingual corpora and language models to 500 languages. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers, pp. 1082-1117). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2023.acl-long.61
- Iomdin, B. L. (2019). How to define words with the same root? Russian Speech, (1), 109-115. DOI:https://doi.org/10.31857/S013161170003980-7
- Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers, pp. 66-75). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/P18-1007
- Kuratov, Y. & Arkhipov, M. (2019). Adaptation of deep bidirectional multilingual transformers for Russian language.Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference Dialogue 2019 (pp. 333-339).Russian State University for the Humanities.
- Kuznetsova, A. I. & Efremova, T. F. (1986). Dictionary of morphemes of the Russian language.Russkii yazyk.
- Levine, L. (2022). Sharing data by language family: Data augmentation for romance language morpheme segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 117-123). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2022.sigmorphon-1.12
- Matthews, A., Neubig, G., & Dyer, C. (2018). Using Morphological knowledge in open-vocabulary neural language models. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (vol. 1, pp. 1435-1445). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/N18-1130
- Morozov, D. A., Smal, I. A., Garipov, T. A., & Glazkova, A. V. (2024). Keywords, morpheme parsing and syntactic trees: Features for text complexity assessment. Modeling and Analysis of Information Systems, 31(2), 206-220. DOI:https://doi.org/10.18255/1818-1015-2024-2-206-220
- Peters, B. & Martins, A. F. T. (2022). Beyond characters: Subword-level morpheme segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 131-138). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2022.sigmorphon-1.14
- Pranjić, M., Robnik-Šikonja M., & Pollak, S. (2024). LLMSegm: Surface-level morphological segmentation using large language model. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (pp. 10665-10674). ELRA and ICCL.
- Savchuk, S. O., Arkhangelskiy, T., Bonch-Osmolovskaya, A. A., Donina, O. V., Kuznetsova, Yu. N., Lyashevskaya, O. N., Orekhov, B. V., & Podryadchikova, M. V. (2024).Russian national corpus 2.0: New opportunities and development prospects. Voprosy Jazykoznanija, 2, 7-34. DOI:https://doi.org/10.31857/0373-658X.2024.2.7-34
- Schuster, M. & Nakajima, K. (2012). Japanese and Korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (pp. 5149-5152). IEEE. DOI:https://doi.org/10.1109/ICASSP.2012.6289079
- Sorokin, A. & Kravtsova, A. (2018). Deep convolutional networks for supervised morpheme segmentation of Russian language. In D. Ustalov, A. Filchenkov, L. Pivovarova, & J. Žižka, (Eds.), Artificial Intelligence and Natural Language (pp. 3-10). Springer. DOI:https://doi.org/10.1007/978-3-030-01204-5_1
- Sorokin, A. (2022). Improving morpheme segmentation using BERT embeddings. In E. Burnaev, D. Ignatov, S. Ivanov, M. Khachay, O. Koltsova, A. Kutuzov, S.Kuznetsov, N. Loukachevitch, A. Napoli, A. Panchenko, P. Pardalos, J. Saramäki, A. Savchenko, E. Tsymbalov, & E. Tutubalina, (Eds.), Analysis of images, social networks and texts (pp. 148-161). Springer. DOI:https://doi.org/10.1007/978-3-031-16500-9_13
- Tikhonov, A. N. (1990). Slovoobrazovatel‘nyi slovar' russkogo yazyka [Word Formation Dictionary of Russian language].Russkiy yazyk.
- Vinokur, G. O. (1946). Zametki po russkomu slovoobrazovaniyu [Notes on Russian word formation]. Izvestiya Akademii nauk SSSR. Seriya literatury i yazyka, V(4), 317-317.
- Wehrli, S., Clematide, S., & Makarov, P. (2022). CLUZH at SIGMORPHON 2022 shared tasks on morpheme segmentation and inflection generation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 212-219). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2022.sigmorphon-1.21
- Zundi, T. & Avaajargal, C. (2022). Word-level Morpheme segmentation using Transformer neural network. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 139-143). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2022.sigmorphon-1.15
Arquivos suplementares


