Wrong Answers Only: Distractor Generation for Russian Reading Comprehension Questions Using a Translated Dataset
- Autores: Login N.V.1
-
Afiliações:
- HSE University
- Edição: Volume 10, Nº 4 (2024)
- Páginas: 56-70
- Seção: Research Papers
- URL: https://ogarev-online.ru/2411-7390/article/view/356609
- DOI: https://doi.org/10.17323/jle.2024.22244
- ID: 356609
Citar
Resumo
Background: Reading comprehension questions play an important role in language learning. Multiple-choice questions are a convenient form of reading comprehension assessment as they can be easily graded automatically. The availability of large reading comprehension datasets makes it possible to also automatically produce these items, reducing the cost of development of test question banks, by fine-tuning language models on them. While English reading comprehension datasets are common, this is not true for other languages, including Russian. A subtask of distractor generation poses a difficulty, as it requires producing multiple incorrect items.
Purpose: The purpose of this work is to develop an efficient distractor generation solution for Russian exam-style reading comprehension questions and to discover whether a translated English-language distractor dataset can offer a possibility for such solution.
Method: In this paper we fine-tuned two pre-trained Russian large language models, RuT5 and RuGPT3 (Zmitrovich et al, 2024), on distractor generation task for two classes of summarizing questions retrieved from a large multiple-choice question dataset, that was automatically translated from English to Russian. The first class consisted of questions on selection of the best title for the given passage, while the second class included questions on true/false statement selection. The models were assessed automatically on test and development subsets, and true statement distractor models were additionally evaluated on an independent set of questions from Russian state exam USE.
Results: It was observed that the models surpassed the non-fine-tuned baseline, the performance of RuT5 model was better than that of RuGPT3, and that the models handled true statement selection questions much better than title questions. On USE data models fine-tuned on translated dataset have shown better quality than that trained on existing Russian distractor dataset, with T5-based model also beating the baseline established by output of an existing English distractor generation model translated into Russian.
Conclusion: The obtained results show the possibility of a translated dataset to be used in distractor generation and the importance of the domain (language examination) and question type match in the input data.
Purpose: The purpose of this work is to develop an efficient distractor generation solution for Russian exam-style reading comprehension questions and to discover whether a translated English-language distractor dataset can offer a possibility for such solution.
Method: In this paper we fine-tuned two pre-trained Russian large language models, RuT5 and RuGPT3 (Zmitrovich et al, 2024), on distractor generation task for two classes of summarizing questions retrieved from a large multiple-choice question dataset, that was automatically translated from English to Russian. The first class consisted of questions on selection of the best title for the given passage, while the second class included questions on true/false statement selection. The models were assessed automatically on test and development subsets, and true statement distractor models were additionally evaluated on an independent set of questions from Russian state exam USE.
Results: It was observed that the models surpassed the non-fine-tuned baseline, the performance of RuT5 model was better than that of RuGPT3, and that the models handled true statement selection questions much better than title questions. On USE data models fine-tuned on translated dataset have shown better quality than that trained on existing Russian distractor dataset, with T5-based model also beating the baseline established by output of an existing English distractor generation model translated into Russian.
Conclusion: The obtained results show the possibility of a translated dataset to be used in distractor generation and the importance of the domain (language examination) and question type match in the input data.
Sobre autores
Nikita Login
HSE University
Email: nlogin@hse.ru
ORCID ID: 0009-0007-2480-8708
Moscow, Russia
Bibliografia
- Alsubait, T. M. (2015). Ontology-based multiple-choice question generation [Unpublished PhD thesis]. University of Manchester.
- Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgements. In J. Goldstein, A. Lavie, C.-Y. Lin, & C. Voss (Eds.), Proceedings of the ACL Workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65-72). Association for Computational Linguistics.
- Belyanova, M. A., Andreev, A. M., & Gapanyuk, Y. E. (2022). Neural text question generation for Russian language using hybrid intelligent information systems approach. In B. Kryzhanovsky, W. Dunin-Barkowski, V. Redko, Y. Tiumentsev, & V. V. Klimov (Eds.), Advances in neural computation, machine learning, and cognitive research V (vol. 1008, pp. 217-223). Springer International Publishing. DOI:https://doi.org/10.1007/978-3-030-91581-0_29
- Bitew, S. K., Hadifar, A., Sterckx, L., Deleu, J., Develder, & C., Demeester, T. (2022) Learning to reuse distractors to support multiple choice question generation in education. IEEE Transactions on Learning Technologies, 17, 375-390. IEEE Computer Society Press. DOI:https://doi.org/10.1109/TLT.2022.3226523
- Bitew, S. K., Deleu, J., Develder, C., & Demeester, T. (2023) Distractor generation for multiple-choice questions with predictive prompting and large language models (Version 1). arXiv. DOI:https://doi.org/10.48550/arXiv.2307.16338
- Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Advances in Neural Information Processing Systems (vol. 33, pp. 1877-1901). Curran Associates, Inc. DOI:https://doi.org/10.48550/arXiv.2005.14165
- Chung, H.-L., Chan, Y.-H., & Fan, Y.-C. (2020). A BERT-based distractor generation scheme with multi-tasking and negative answer training strategies. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 4390-4400). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2020.findings-emnlp.393
- De-Fitero-Dominguez, D., Garcia-Lopez, E., Garcia-Cabot, A., Del-Hoyo-Gabaldon, J.-A., & Moreno-Cediel, A. (2024).
- Distractor generation through text-to-text transformer models. IEEE Access, 12, 25580-25589. DOI:https://doi.org/10.1109/ACCESS.2024.3361673
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies (vol. 1: Long and Short Paper, pp. 4171-4186). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/N19-1423
- Efimov, P., Chertok, A., Boytsov, L., & Braslavski, P. (2020). SberQuAD - Russian reading comprehension dataset: Description and analysis. In A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, H. Joho, C. Lioma, C. Eickhoff, A. Névéol, L. Cappellato, & N. Ferro (Eds.), Experimental IR meets multilinguality, multimodality, and interaction (vol. 12260, pp. 3-15). Springer International Publishing. DOI:https://doi.org/10.1007/978-3-030-58219-7_1
- Elkins, S., Kochmar, E., Serban, I., & Cheung, J. C. K. (2023). How useful are educational questions generated by large language models? In N. Wang, G. Rebolledo-Mendez, V. Dimitrova, N. Matsuda, & O. C. Santos (Eds.), Artificial intelligence in education. Posters and late breaking results, workshops and tutorials, industry and innovation tracks, practitioners, doctoral consortium and blue sky (vol. 1831, pp. 536-542). Springer Nature Switzerland. DOI:https://doi.org/10.1007/978-3-031-36336-8_83
- Fenogenova, A., Mikhailov, V., & Shevelev, D. (2020). Read and reason with MuSeRC and RuCoS: Datasets for machine reading comprehension for Russian. In D. Scott, N. Bel, & C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics (pp. 6481-6497).International Committee on Computational Linguistics. DOI:https://doi.org/10.18653/v1/2020.coling-main.570
- Gao, Y., Bing, L., Li, P., King, I., & Lyu, M. R. (2019). Generating distractors for reading comprehension questions from real examinations. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 6423-6430. DOI:https://doi.org/10.1609/aaai.v33i01.33016423
- Ghanem, B. & Fyshe, A. (2024). DISTO: Textual distractors for multiple choice reading comprehension questions using negative sampling. In M. Marras, M. Ueno (Eds.), Proceedings of the 17th International Conference on Educational Data Mining (pp. 23-34).International Educational Data Mining Society. DOI:https://doi.org/10.5281/ZENODO.12729766
- Glushkova, T., Machnev, A., Fenogenova, A., Shavrina, T., Artemova, E., & Ignatov, D. I. (2021). DaNetQA: A yes/no question answering dataset for the Russian language. In W. M. P. Van Der Aalst, V. Batagelj, D. I. Ignatov, M. Khachay, O. Koltsova, A. Kutuzov, S. O. Kuznetsov, I. A. Lomazova, N. Loukachevitch, A. Napoli, A. Panchenko, P. M. Pardalos, M. Pelillo, A. V. Savchenko, & E. Tutubalina (Eds.), Analysis of Images, Social Networks and Texts (vol. 12602, pp. 57-68). Springer International Publishing. DOI:https://doi.org/10.1007/978-3-030-72610-2_4
- Hadifar, A., Bitew, S. K., Deleu, J., Develder, C., & Demeester, T. (2023). EduQG: A multi-format multiple-choice dataset for the educational domain. IEEE Access, 11, 20885-20896. DOI:https://doi.org/10.1109/ACCESS.2023.3248790
- Huang, L., Le Bras, R., Bhagavatula, C., & Choi, Y. (2019). CosmosQA: Machine reading comprehension with contextual commonsense reasoning. In K. Inui, J. Jiang, V. Ng, & X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 2391-2401). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/D19-1243
- Joshi, M., Choi, E., Weld, D., & Zettlemoyer, L. (2017). TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R Barzilay., & M.-Y. Kan (Eds.), Proceedings of the 55th Annual Meeting of the Association for Computational linguistics (vol. 1: Long Papers, pp. 1601-1611). Association for Computational linguistics. DOI:https://doi.org/10.18653/v1/P17-1147
- Kurdi, G., Leo, J., Parsia, B., Sattler, U., & Al-Emari, S. (2020). A systematic review of automatic question generation for educational purposes.International Journal of Artificial Intelligence in Education, 30(1), 121-204. DOI:https://doi.org/10.1007/s40593-019-00186-y
- Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A. M., Uszkoreit, J., Le, Q., & Petrov, S. (2019). Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7, 453-466. DOI:https://doi.org/10.1162/tacl_a_00276
- Lai, G., Xie, Q., Liu, H., Yang, Y., & Hovy, E. (2017). RACE: Large-scale reading comprehension dataset from examinations. In M. Palmer, R. Hwa, & S. Riedel (Eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 785-794). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/D17-1082
- Lee, D. B., Lee, S., Jeong, W. T., Kim, D., & Hwang, S. J. (2020). Generating diverse and consistent QA pairs from contexts with information-maximizing hierarchical conditional VAEs. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 208-224). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2020.acl-main.20
- Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2020). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In D. Jurafsky.
- J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7871-7880). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2020.acl-main.703
- Lin, C.-Y. (2004). ROUGE: A Package for automatic evaluation of summaries. In Text summarization branches out (pp. 74-81). Association for Computational Linguistics.https://aclanthology.org/W04-1013.
- Lu, X., West, P., Zellers, R., Bras, R. L., Bhagavatula, C., & Choi, Y. (2021). NeuroLogic decoding: (Un)supervised neural text generation with predicate logic constraints. In K. Toutanova, A.Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, & Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies (pp. 4288-4299). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2021.naacl-main.339
- Maity, S., Deroy, A., & Sarkar, S. (2024). A novel multi-stage prompting approach for language agnostic MCQ generation using GPT. In N. Goharian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, & I. Ounis (Eds.), Advances in information retrieval (vol. 14610, pp. 268-277). Springer Nature Switzerland. DOI:https://doi.org/10.1007/978-3-031-56063-7_18
- Makhnytkina, O., Matveev, A., Svischev, A., Korobova, P., Zubok, D., Mamaev, N., & Tchirkovskii, A. (2020). Conversational question generation in Russian. In S. Balandin, L. Turchet, & T. Tyutina (Eds.), 2020 27th Conference of Open Innovations Association (FRUCT) (pp. 1-8). IEEE. DOI:https://doi.org/10.23919/FRUCT49677.2020.9211056
- Manakul, P., Liusie, A., & Gales, M. (2023). MQAG: Multiple-choice question answering and generation for assessing information consistency in summarization. In J. C. Park, Y. Arase, B. Hu, W. Lu, D. Wijaya, A. Purwarianti, & A. A. Krisnadhi (Eds.), Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific chapter of the Association for Computational Linguistics (vol. 1: Long Papers, pp. 39-53). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2023.ijcnlp-main.4
- Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. In The 40th Annual Meeting on Association for Computational Linguistics-ACL '02 (pp. 311-318). Association for Computational Linguistics. DOI:https://doi.org/10.3115/1073083.1073135
- Paris, A. H., & Paris, S. G. (2003). Assessing narrative comprehension in young children. Reading Research Quarterly, 38(1), 36-76. DOI:https://doi.org/10.1598/RRQ.38.1.3
- Qiu, Z., Wu, X., & Fan, W. (2020). Automatic distractor generation for multiple choice questions in standard tests. In D. Scott, N. Bel, & C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics (pp. 2096-2106). International Committee on Computational Linguistics. DOI:https://doi.org/10.18653/v1/2020.coling-main.189
- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21, 1, 5485-5551.https://dl.acm.org/doi/abs/. DOI:https://doi.org/10.5555/3455716.3455856
- Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ questions for machine comprehension of text. In J. Su, K. Duh, & X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2383-2392). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/D16-1264
- Reddy, S., Chen, D., & Manning, C. D. (2019). CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7, 249-266. DOI:https://doi.org/10.1162/tacl_a_00266
- Rybin, I., Korablinov, V., Efimov, P., & Braslavski, P. (2021).RuBQ 2.0: An innovated Russian question answering dataset. In R. Verborgh, K. Hose, H. Paulheim, P.-A. Champin, M. Maleshkova, O. Corcho, P. Ristoski, & M. Alam (Eds.), The Semantic Web (vol. 12731, pp. 532-547). Springer International Publishing. DOI:https://doi.org/10.1007/978-3-030-77385-4_32
- Sekulić, I., Aliannejadi, M., & Crestani, F. (2021). Towards facet-driven generation of clarifying questions for conversational search. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval (pp. 167-175). Association for Computing Machinery. DOI:https://doi.org/10.1145/3471158.3472257
- Shavrina, T., Emelyanov, A., Fenogenova, A., Fomin, V., Mikhailov, V., Evlampiev, A., Malykh, V., Larin, V., Natekin, A., Vatulin, A., Romov, P., Anastasiev, D., Zinov, N., & Chertok, A. (2020, May). Humans keep it one hundred: An overview of AI Journey. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 2276-2284). European Language Resources Association.https://aclanthology.org/2020.lrec-1.277.
- Tiedemann, J., & Thottingal, S. (2020). OPUS-MT - Building open translation services for the world. In A. Martins, H. Moniz, S. Fumega, B. Martins, F. Batista, L. Coheur, C. Parra, I. Trancoso, M. Turchi, A. Bisazza, J. Moorkens, A. Guerberof.
- M. Nurminen, L. Marg, & M. L. Forcada (Eds.), Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (pp. 479-480). European Association for Machine Translation.https://aclanthology.org/2020.eamt-1.61.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. ukasz, & Polosukhin, I. (2017). Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (vol. 30, 6000-6010). Curran Associates, Inc.https://dl.acm.org/doi/. DOI:https://doi.org/10.5555/3295222.3295349
- Welbl, J., Liu, N. F., & Gardner, M. (2017). Crowdsourcing multiple choice science questions. In L. Derczynski, W. Xu, A. Ritter, & T. Baldwin (Eds.), Proceedings of the 3rd Workshop on Noisy User-generated Text (pp. 94-106). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/W17-4413
- Xiao, D., Zhang, H., Li, Y., Sun, Y., Tian, H., Wu, H., & Wang, H. (2020). ERNIE-GEN: An enhanced multi-flow pre-training and fine-tuning framework for natural language generation. In C. Bessiere (Ed.) Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (pp. 3997-4003).International Joint Conferences on Artificial Intelligence Organization. DOI:https://doi.org/10.24963/ijcai.2020/553
- Xu, Y., Wang, D., Yu, M., Ritchie, D., Yao, B., Wu, T., Zhang, Z., Li, T., Bradford, N., Sun, B., Hoang, T., Sang, Y., Hou, Y., Ma, X., Yang, D., Peng, N., Yu, Z., & Warschauer, M. (2022). Fantastic questions and where to find them: FairytaleQA - An authentic dataset for narrative comprehension. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers, pp. 447-460). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2022.acl-long.34
- Xue, L., Constant, N., Roberts, A., Kale, N., Al-Rfou, R., Siddhant, A., Barua, A., & Raffel, C. (2020). MT5: A massively multilingual pre-trained text-to-text transformer (Version 3). arXiv. DOI:https://doi.org/10.48550/arXiv.2010.11934
- Zhang, C. (2023). Automatic generation of multiple-choice questions (Version 1). arXiv. DOI:https://doi.org/10.48550/ARXIV.2303.14576
- Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). BERTScore: Evaluating text generation with BERT (Version 3). arXiv. DOI:https://doi.org/10.48550/ARXIV.1904.09675
- Zmitrovich, D., Abramov, A., Kalmykov, A., Tikhonova, M., Taktasheva, E., Astafurov, D., Baushenko, M., Snegirev, A., Kadulin, V., Markov, S., Shavrina, T., Mikhailov, V., & Fenogenova, A. (2024). A family of pretrained transformer language models for Russian. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 507-524). ELRA Language Resource Association. DOI:https://doi.org/10.48550/arXiv.2309.10931
Arquivos suplementares


