Progress in Natural Language Processing Technologies: Regulating Quality and Accessibility of Training Data

I. Ilyin; Ilyin I.

doi:10.17323/2713-2749.2024.2.36.56

Progress in Natural Language Processing Technologies: Regulating Quality and Accessibility of Training Data

Авторлар: Ilyin I.¹
Мекемелер:
1. Saint Petersburg State University
Шығарылым: Том 5, № 2 (2024)
Беттер: 36-56
Бөлім: Artificial Intelligence and law
URL: https://ogarev-online.ru/2713-2749/article/view/294027
DOI: https://doi.org/10.17323/2713-2749.2024.2.36.56
ID: 294027

Дәйексөз келтіру

Толық мәтін

Аннотация
Авторлар туралы
Әдебиет тізімі
Қосымша файлдар
Статистика

Аннотация

Progress in natural language processing technologies (NLP) is a cardinal factor of major socioeconomic importance behind innovative digital products. However, inadequate legal regulation of quality and accessibility of training data is a major obstacle to this technological development. The paper is focused on regulatory issues affecting the quality and accessibility of data needed for language model training. In analyzing the normative barriers and proposing ways to remove them, the author of the paper argues for the need to develop a comprehensive regulatory system designed to ensure sustainable development of the technology.

Негізгі сөздер

personal data, data regime, generative neural network, artificial intelligence, natural language processing, large language models, data access, copyright, personal data, data regime, generative neural network, artificial intelligence, natural language processing, large language models, data access, copyright

Авторлар туралы

I. Ilyin

Saint Petersburg State University

Хат алмасуға жауапты Автор.
Email: i.g.ilin@spbu.ru

Әдебиет тізімі

Dash N.S., Arulmozi S. (2018) History, features, and typology of language corpora. Singapore: Springer, p. 291.
Feng Z. (2023) Formal analysis for natural language processing: a handbook. Berlin: Springer Nature, pp. 7,8, 25.
Gavrilov E.P. (2009) Copyright and the content of artistic work. Patenty i litsenzii=Patents and Licenses, no. 7, pp. 31–38 (in Russ.)
Glauner P. (2024) Technical foundations of generative AI models. Legal Tech — Zeitschrift für die digitale Anwendung, pp. 24–34.
Goldberg Y. (2017) Features for textual data. In: Neural network methods for natural language processing. Cham: Springer, pp. 65–76.
Gracheva D.A. (2023) Free use of copyright and related rights in the context of development of digital technologies in Russia. Trudy po intellektualnoy sobstvennosti=Works on Intellectual Property, vol. 45, no. 2, pp. 44–52 (in Russ.)
Hacker P. (2021) A legal framework for AI training data—from first principles to the Artificial Intelligence Act. Law, Innovation and Technology, vol. 13, no. 2, pp. 257–301.
Hirschberg J., Manning C.D. (2015) Advances in natural language processing. Science, vol. 349, no. 6245, pp. 261–266.
Kashanin A.V. (2010) Development of ideas on the form and content of works in the copyright doctrine. The problem of protectability of research works. Vestnik grazhdanskogo prava=Bulletin of Civil Law, vol. 10, no. 2, pp. 68–138 (in Russ.)
Kelli A., Vider K., Lindén K. (2016) The regulatory and contractual framework as an integral part of the CLARIN infrastructure. CLARIN Annual Conference. Linköping University Electronic Press, pp. 13-24. Available at: https://helda.helsinki.fi/server/api/core/bitstreams/1f7b8a3c-790c-4e66-9677-f5f9aca785d6/content (accessed: 04.07.2024)
Khyani D. et al. (2021) An interpretation of lemmatization and stemming in natural language processing. Journal of Shanghai University for Science and Technology, vol. 22, no. 10, pp. 350–357.
Kolain M., Grafenauer C., Ebers M. (2021) Anonymity assessment-a universal tool for measuring anonymity of data sets under the GDPR with a special focus on smart robotics. Rutgers Computer & Technology Law Journal, vol. 48, p. 174.
Kolzdorf M.A. (2021) Free use of the items subject to copyright and related rights in Big Data processing. Zakon=Law, no. 5, pp. 142–164 (in Russ.)
Li T.C. (2022) Algorithmic destruction. Southern Methodist University Law Review, vol. 75, pp. 480-505. DOI: https://doi.org/10.25172/smulr.75.3.2
Lythreatis S. et al. (2022) The digital divide: a review and future research agenda. Technological Forecasting and Social Change, vol. 175, pp. 1–11.
Mushakov V.E. (2022) Constitutional human rights in the context of addressing the digital divide. Vestnik Sankt-Petersburgskogo universiteta MVD=Bulletin of Saint Petersburg University of Interior Ministry, no. 1, pp. 69–73 (in Russ.)
Oostveen M. (2016) Identifiability and the applicability of data protection to big data. International Data Privacy Law, vol. 6, no. 4, pp. 299–309.
Rahman A. (2020) Algorithms of oppression: how search engines reinforce racism. New Media & Society, vol. 22, no. 3, pp. 575–577. DOI: https://doi.org/10.1177/1461444819876115.
Rogers S.E. (2016) Bridging the 21st century digital divide. TechTrends, vol. 60, no. 3, pp. 197–199.
Russo A., Proutiere A. (2021) Poisoning attacks against data-driven control methods. 2021 American Control Conference (ACC). IEEE, pp. 3234–3241. Available at: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9482992 (accessed: 04.07.2024). doi: 10.23919/ACC50511.2021.9482992.
Schneier B. (2015) Data and Goliath: the hidden battles to collect your data and control your world. N.Y.: Norton, 448 p.
Truyens M., Van Eecke P. (2014) Legal aspects of text mining. Computer Law & Security Review, vol. 30, no. 2, pp. 153–170.
Zhou M. et al. (2020) Progress in neural NLP: modeling, learning, and reasoning. Engineering, vol. 6, no. 3, pp. 275–290.

Қосымша файлдар

Әрекет

1. JATS XML

Жүктеу

Пайдаланушының аты
Құпиясөз
Мені есте сақтау

Құпия сөзді ұмыттыңыз ба?	Тіркеу

Пайдаланушының аты
Құпиясөз
Мені есте сақтау

Құпия сөзді ұмыттыңыз ба?	Тіркеу

Том 6, № 1 (2025)

Том 6, № 1 (2025)

Progress in Natural Language Processing Technologies: Regulating Quality and Accessibility of Training Data

Толық мәтін

Аннотация

Негізгі сөздер

Авторлар туралы

I. Ilyin

Әдебиет тізімі

Қосымша файлдар