Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation

Capa

Citar

Resumo

Background: Recent advancements in large language model (LLM) technologies have introduced powerful open-source instruction-tuned LLMs that match the text generation quality of leading models like GPT-4. Despite accelerating LLM adoption in sensitive-information environments, the lack of disclosed training data hinders replication and makes these achievements exclusive to specific models.

Purpose: Given the multilingual nature of the latest iteration of open-source LLMs, the benefits of training language-specific LLMs diminish, leaving computational efficiency as the sole guaranteed advantage of this computationally-expensive procedure. This work aims to address the language-adaptation limitations posed by restricted access to high-quality instruction-tuning data, offering a more cost-effective pipeline.

Method: To tackle language-adaptation challenges, we introduce Learned Embedding Propagation (LEP), a novel method with lower training data requirements and minimal disruption of existing LLM knowledge. LEP employs an innovative embedding propagation technique, bypassing the need for instruction-tuning and directly integrating new language knowledge into any instruct-tuned LLM variant. Additionally, we developed Darumeru, a new benchmark for evaluating text generation robustness during training, specifically tailored for Russian adaptation.

Results: We applied the LEP method to adapt LLaMa-3-8B and Mistral-7B for Russian, testing four different vocabulary adaptation scenarios. Evaluation demonstrates that LEP achieves competitive performance levels, comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct. Further improvements were observed through self-calibration and additional instruction-tuning steps, enhancing task-solving capabilities beyond the original models.

Conclusion: LEP offers a viable and efficient alternative to traditional language-specific instruction-tuning, significantly reducing the costs associated with language adaptation while maintaining or surpassing the performance benchmarks set by contemporary LLMs.

Sobre autores

Mikhail Tikhomirov

Lomonosov Moscow State University

Email: tikhomirov.mm@gmail.com
ORCID ID: 0000-0001-7209-9335
Moscow, Russia

Daniil Chernyshov

Lomonosov Moscow State University

Email: chdanorbis@yandex.ru
ORCID ID: 0009-0001-6847-2122
Moscow, Russia

Bibliografia

  1. Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). Stanford alpaca: An instruction-following llama model.https://github.com/tatsu-lab/stanford_alpaca.
  2. Li, H., Koto, F., Wu, M., Aji, A. F., & Baldwin, T. (2023). Bactrian-x: Multilingual replicable instruction-following models with low-rank adaptation. arXiv preprint arXiv:2305.15011. DOI:https://doi.org/10.48550/arXiv.2305.15011
  3. Wei, X., Wei, H., Lin, H., Li, T., Zhang, P., Ren, X., Li, M., Wan, Y., Cao, Z., Xie, B., Hu, T., Li, S., Hui, B., Yu, B., Liu, D., Yang, B., & Xie, J. (2023). Polylm: An open source polyglot large language model. arXiv:2307.06018. DOI:https://doi.org/10.48550/arXiv.2307.06018
  4. Gusev, I. (2023).rulm: A toolkit for training neural language models.https://github.com/IlyaGusev/rulm.
  5. Kuulmets, H. A., Purason, T., Luhtaru, A., & Fishel, M. (2024, June). Teaching Llama a new language through cross-lingual knowledge transfer. In Findings of the Association for Computational Linguistics: NAACL 2024 (pp. 3309-3325). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2024.findings-naacl.210
  6. Zhu, W., Lv, Y., Dong, Q., Yuan, F., Xu, J., Huang, S., Kong, L., & Li, L. (2023). Extrapolating large language models to non-english by aligning languages. arXiv:2308.04948. DOI:https://doi.org/10.48550/arXiv.2308.04948
  7. Ranaldi, L., Pucci, G., & Freitas, A. (2023). Empowering cross-lingual abilities of instruction-tuned large language models by translation-following demonstrations. arXiv:2308.14186. DOI:https://doi.org/10.48550/arXiv.2308.14186
  8. Li, C., Wang, S., Zhang, J., & Zong, C. (2024, June). Improving in-context learning of multilingual generative language models with cross-lingual alignment. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (vol. 1: Long Papers, pp. 8051-8069). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2024.naacl-long.445
  9. Chai, L., Yang, J., Sun, T., Guo, H., Liu, J., Wang, B., Liang, X., Bai, J., Li, T., Peng, Q., & Li, Z. (2024). xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning. arXiv preprint arXiv:2401.07037. DOI:https://doi.org/10.48550/arXiv.2401.07037
  10. Husain, J. A., Dabre, R., Kumar, A., Puduppully, R., & Kunchukuttan, A. (2024). RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models models via Romanization. arXiv:2401.14280. DOI:https://doi.org/10.48550/arXiv.2401.14280
  11. Lakew, S. M., Erofeeva, A., Negri, M., Federico, M., & Turchi, M. (2018). Transfer learning in multilingual neural machine translation with dynamic vocabulary. Proceedings of the 15th International Conference on Spoken Language Translation (pp. 54-61).International Conference on Spoken Language Translation. DOI:https://doi.org/10.48550/arXiv.1811.01137
  12. Kuratov, Y., & Arkhipov, M. (2019). Adaptation of deep bidirectional multilingual transformers for Russian language. Komp'juternaja Lingvistika i Intellektual'nye Tehnologii (pp. 333-339). Komp'juternaja Lingvistika i Intellektual'nye Tehnologii. DOI:https://doi.org/10.48550/arXiv.1905.07213
  13. Rust, P., Pfeiffer, J., Vulić, I., Ruder, S., & Gurevych, I. (2021, August). How good is your tokenizer? On the monolingual performance of multilingual language models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (vol. 1: Long Papers, pp. 3118-3135). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2021.acl-long.243
  14. Yang, Z., Xu, Z., Cui, Y., Wang, B., Lin, M., Wu, D., & Chen, Z. (2022, October). CINO: A Chinese minority pre-trained Language Model. In Proceedings of the 29th International Conference on Computational Linguistics (pp. 3937-3949).International Committee on Computational Linguistics. DOI:https://doi.org/10.48550/arXiv.2202.13558
  15. Vries, W., & Nissim, M. (2021, August). As good as new. How to successfully recycle English GPT-2 to make models for other languages. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 836-846). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2021.findings-acl.74
  16. Tikhomirov, M., & Chernyshev, D. (2023). Impact of tokenization on LLaMa Russian adaptation. 2023 Ivannikov Ispras Open Conference (pp. 163-168). IEEE. DOI:https://doi.org/10.1109/ISPRAS60948.2023.10508177
  17. Tikhomirov, M., & Chernyshev, D. (2024). Improving Large Language Model Russian adaptation with preliminary vocabulary optimization. Lobachevskii Journal of Mathematics, 45, 3211-3219. DOI:https://doi.org/10.1134/S1995080224604120
  18. Cui, Y., Yang, Z., & Yao, X. (2023). Efficient and effective text encoding for Chinese llama and alpaca. arXiv:2304.08177. DOI:https://doi.org/10.48550/arXiv.2304.08177
  19. Nguyen, X. P., Zhang, W., Li, X., Aljunied, M., Tan, Q., Cheng, L., Chen, G., Deng, Y., Yang, S., Liu, C., Zhang, H., & Bing, L. (2023). SeaLLMs-Large Language Models for Southeast Asia. arXiv:2312.00738. DOI:https://doi.org/10.48550/arXiv.2312.00738
  20. Nikolich, A., Korolev, K., & Shelmanov, A. (2024). Vikhr: The family of open-source instruction-tuned Large Language Models for Russian. arXiv preprint arXiv:2405.13929. DOI:https://doi.org/10.48550/arXiv.2405.13929
  21. Artetxe, M., Ruder, S., & Yogatama, D. (2020). On the cross-lingual transferability of monolingual representations. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (pp. 4623-4637). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2020.acl-main.421
  22. Chen, Y., Marchisio, K., Raileanu, R., Adelani, D., Saito Stenetorp, P. L. E., Riedel, S., & Artetxe, M. (2023). Improving language plasticity via pretraining with active forgetting. Advances in Neural Information Processing Systems, 36, 31543-31557. DOI:https://doi.org/10.48550/arXiv.2307.01163
  23. Tejaswi, A., Gupta, N., & Choi, E. (2024). Exploring design choices for building language-specific LLMs. arXiv preprint arXiv:2406.14670. DOI:https://doi.org/10.48550/arXiv.2406.14670
  24. Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Zhang, S., Ghosh, G., Lewis, M., Zettlemoyer, L., & Levy, O. (2024). Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36. DOI:https://doi.org/10.48550/arXiv.2305.11206
  25. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A.,.. & Ganapathy, R. (2024). The Llama 3 Herd of Models. arXiv:2407.21783. DOI:https://doi.org/10.48550/arXiv.2407.21783
  26. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. arXiv:2001.08361. DOI:https://doi.org/10.48550/arXiv.2001.08361
  27. Ilharco, G., Ribeiro, M. T., Wortsman, M., Schmidt, L., Hajishirzi, H., & Farhadi, A. Editing models with task arithmetic. The Eleventh International Conference on Learning Representations.International Conference on Learning Representations. DOI:https://doi.org/10.48550/arXiv.2212.04089
  28. Gusev, I. (2020). Dataset for automatic summarization of Russian news. In Artificial Intelligence and Natural Language: 9th Conference (Proceedings 9, pp. 122-134). Springer International Publishing. DOI:https://doi.org/10.1007/978-3-030-59082-6_9
  29. Dubois, Y., Galambosi, B., Liang, P., & Hashimoto, T. B. (2024). Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv:2404.04475. DOI:https://doi.org/10.48550/arXiv.2404.04475
  30. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A, Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730-27744. DOI:https://doi.org/10.48550/arXiv.2203.02155
  31. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Roziere, B., Goyal, N., Hambro, E., Azhar, F.
  32. Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023a). Llama: Open and efficient foundation language models. arXiv:2302.13971. DOI:https://doi.org/10.48550/arXiv.2302.13971
  33. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y.,.. & Scialom, T. (2023b). Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288. DOI:https://doi.org/10.48550/arXiv.2307.09288
  34. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. D. L., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023). Mistral 7B. arXiv:2310.06825. DOI:https://doi.org/10.48550/arXiv.2310.06825
  35. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L.,.. & McGrew, B. (2023). Gpt-4 technical report. arXiv:2303.08774. DOI:https://doi.org/10.48550/arXiv.2303.08774
  36. Fenogenova A. et al. (2024). Mera: A comprehensive LLM evaluation in Russian. arXiv:2401.04531. DOI:https://doi.org/10.48550/arXiv.2401.04531
  37. Mikhailov, V., Shamardina, T., Ryabinin, M., Pestova, A., Smurov, I., & Artemova, E. (2022).RuCoLA: Russian corpus of linguistic acceptability. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp. 5207-5227). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2022.emnlp-main.348

Arquivos suplementares

Arquivos suplementares
Ação
1. JATS XML


Creative Commons License
Este artigo é disponível sob a Licença Creative Commons Atribuição 4.0 Internacional.