The use of topic modeling to optimize the process of searching for relevant historical documents (on the example of the stock exchange press of the early 20th century)
- Authors: Galushko I.N.1
-
Affiliations:
- Issue: No 2 (2023)
- Pages: 129-144
- Section: Articles
- URL: https://ogarev-online.ru/2585-7797/article/view/367045
- DOI: https://doi.org/10.7256/2585-7797.2023.2.43466
- EDN: https://elibrary.ru/SKBPNS
- ID: 367045
Cite item
Full Text
Abstract
The key task of the presented article is to test how we can analyze the information potential of a historical sources collection by using thematic modeling. Some modern collections of digitized historical materials number tens of thousands of documents, and at the level of an individual researcher, it is difficult to cover available funds. Following a number of researchers, we suggest that thematic modeling can become a convenient tool for preliminary assessment of the content of a collection of historical documents; can become a tool for selecting only those documents that contain information relevant to the research tasks. In our case, the Birzhevye Vedomosti newspaper was chosen as one of the main collection of historical documents. At this stage, we can confirm that in our study, the use of topic modeling proved to be a productive solution for optimizing the process of searching for historical documents in a large collection of digitized historical materials. At the same time, it should be emphasized that in our work topic modeling was used exclusively as an applied tool for primary assessment of the information potential of a documents collection through the analysis of selected topics. Our experience has shown that, at least for Birzhevye Vedomosti, topic modeling with LDA does not allow us to draw conclusions from the standpoint of our content analysis methodology. The data of our models are too fragmentary, it can only be used for the initial assessment of the topics describing the information contained in the source.
References
URL: http://docs.historyrussia.org/ru/nodes/1-glavnaya Tze-I Yang, A.J.Torget, R.Mihalcea (2011). Topic modeling in historical newspapers. Marjanen, J., Zosa, E., Hengchen, S., Pivovarova, L., & Tolonen, M. (2020). Topic Modelling Discourse Dynamics in Historical Newspapers. DHN Post-Proceedings. Koentges, Thomas (2020). Measuring Philosophy in the First Thousand Years of Greek Literature. Egger, Roman (2020). A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Галушко И.Н. Корректировка результатов OCR-распознавания текста исторического источника с помощью нечетких множеств (на примере газеты начала XX века) // Историческая информатика. – 2023. – № 1. – С. 102-113. Представленная статья является частью моей магистерской диссертации по теме: «Поведенческие аспекты анализа доходности ценных бумаг на фондовом рынке Российской империи в начале XX века: контент-анализ биржевых нарративов». Найденные LDA-алгоритмом выпуски «Биржевых ведомостей» в данной работе рассматривались в сочетании с материалами фонда №143 ЦГАМ (Московский биржевой комитет) и трудами биржевых практиков начала XX в. (Васильев А.А. Биржевая спекуляция, теория и практика. СПб., 1912.). Воронцов К. В. Вероятностное тематическое моделирование: теория, модели, алгоритмы и проект BigARTM. 2020. GitHub. URL: https://github.com/iodinesky/Topic-modeling-in-historical-newspapers Воронцов К. В. Вероятностное тематическое моделирование: теория регуляризации ARTM и библиотека с открытым кодом BigARTM. 2023.
Supplementary files

