We use cookies in order to improve the quality and usability of the HSE website. More information about the use of cookies is available here, and the regulations on processing personal data can be found here. By continuing to use the site, you hereby confirm that you have been informed of the use of cookies by the HSE website and agree with our rules for processing personal data. You may disable cookies in your browser settings.
The School of Linguistics was founded in December 2014. Today, the School offers undergraduate and graduate programs in theoretical and computational linguistics. Linguistics as it is taught and researched at the School does not simply involve mastering foreign languages. Rather, it is the science of language and the methods of its modeling. Research groups in the School of Linguistics study typology, socio-linguistics and areal linguistics, corpus linguistics and lexicography, ancient languages and the history of languages. The School is also developing linguistic technologies and electronic resources: corpora, training simulators, dictionaries, thesauruses, and tools for digital storage and processing of written texts.
Kantserova A. O., Oknina L. B., Pitskhelauri D. I. et al.
Neuroscience and Behavioral Physiology. 2023. Vol. 53. No. 3. P. 358-364.
In bk.: Proceedings of Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). Association for Computational Linguistics, 2023. P. 174-186.
arxiv.org. Computer Science. Cornell University, 2023
The School of Linguistics currently realizes a large multi-discipline project "Linguistic technologies in the era of digital revolution" with a whole bunch of sub-projects. Project leaders are Olga N. Lyashevskaya, h-index — 3; Anastasiya A. Bonch-Osmolovskaya, h-index — 2.
The purpose of the project is development and usage of linguistic technologies for the collection and electronic publication of data in the history of Russian language, modern Russian grammar and lexicography and for preservation of Russian literary heritage. The project focuses on the implementation of European standards and practices in data curation and preservation. As a result of project’s implementation the following items will be developed: an electronic dictionary of the Russian language in standardized computer-readable format that integrates all present lexicographic data on contemporary and diachronic conditions of the Russian language and about the dynamics of its development; an integrated Russian grammar database that takes into account contemporary internet-communications; the electronic publication of Russian literary heritage with word sense tagging, the integration of alternative versions and a critical apparatus (an electronic “Russian Classics Guide”); online courses about the digital representation of literary heritage; the integration of data on Russian lexis and grammar and on Russian literary heritage into global data storage systems (Linked Open Data).
Below the relevant parts of the project are described.
1) The project of digitization of Russian classical texts began two years ago and is headed by Anastasia Bonch-Osmolovskaya and Boris Orekhov. In 2014, a unique crowdsourcing project “All Tolstoy in one click” was launched by the Tolstoy Museum in Moscow and one of Russian’s top IT companies ABBYY, a leader in optical recognition. The 90 volume edition was digitized with the help of ABBYY OCR technology and then proofread by thousands of volunteers from 49 countries in two weeks. Now the works can be downloaded free of charge in popular e-books formats. The project’s objectives may be stated in the following way:
The first task has now been solved. The next stage of the project will be devoted to developing a biographical database. It has been demonstrated elsewhere that a large portion of bio facts may be obtained automatically (Bonch-Osmolovskaya, Kolbasov 2015). The further plans are to link the database items to external web sites, such as Wikipedia, DBpedia and relevant Linked Open data archives.
The research team is also working on a system of online courses on methods, problems and technical approaches to digitalization of literary heritage.
2) Another sub-project of the general project is focused on development of electronic dictionaries (the main participant is Olga Lyashevskaya). Based on the success of the “Frequency dictionary of modern Russian”, the group creates new corpus-based dictionaries. Here belongs, for instance, the project of frequency dictionary of inflectional paradigms (see a description at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2286096##).
The new corpus-based dictionaries should be an alternative of traditional dictionaries, which are not always linked to real language use and do not mark the degree of frequency of a unit. The frequency dictionary of inflectional paradigms, for instance, departs from traditional frequency lexicography in several ways:
The dictionary sub-project has also another, a long-term aim. The participants seek to develop a standardized computer-readable format that allows a user to bring together all existing lexicographic data on contemporary and diachronic conditions of the Russian language.
3) The project of corpus-oriented Russian grammar is being developed by scholars from the School of Linguistics of Faculty of Humanities (Valentina Apresyan, Michael Daniel, Nina Dobrushina, Alexander Letuchiy, Olga Lyashevskaya, Ekaterina Rakhilina) together with specialists from the Russian Language Institute (Russian Academy of Sciences) and other institutions. For the time being, more than thirty chapters on various grammatical phenomena (imperative, conjunctions, reflexive, finiteness, and so on) are written. The grammar includes examples of real language use, taken from the Russian National Corpus (http://www.ruscorpora.ru/en/index.html). The language data is oriented to modern formal and functional approaches to linguistics.
Although Russian has a well-developed grammatical tradition, the existing academic grammars often lack connection with modern data. This is why the Rusgram project is really important for the description of Russian.
Alongside the Rusgram project, which is primarily concentrated on literary language, a new sub-project, focused on electronic communication, is being developed.