• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
Book
The many facets of agreement

Lincom Europa, 2023.

Book chapter
Disambiguation in context in the Russian National Corpus: 20 years later

Afanasev I., Lyashevskaya O., Ребриков С. А. et al.

In bk.: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2023”. M.: 2023.

Working paper
Identifying the style by a qualified reader on a short fragment of generated poetry

Orekhov B.

arxiv.org. Computer Science. Cornell University, 2023

Research Projects

The School of Linguistics currently realizes a large multi-discipline project "Linguistic technologies in the era of digital revolution" with a whole bunch of sub-projects. Project leaders are  Olga N. Lyashevskaya, h-index — 3; Anastasiya A. Bonch-Osmolovskaya, h-index — 2.

The purpose of the project is development and usage of linguistic technologies for the collection and electronic publication of data in the history of Russian language, modern Russian grammar and lexicography and for preservation of Russian literary heritage. The project focuses on the implementation of European standards and practices in data curation and preservation. As a result of project’s implementation the following items will be developed: an electronic dictionary of the Russian language in standardized computer-readable format that integrates all present lexicographic data on contemporary and diachronic conditions of the Russian language and about the dynamics of its development; an integrated Russian grammar database that takes into account contemporary internet-communications; the electronic publication of Russian literary heritage with word sense tagging, the integration of alternative versions and a critical apparatus (an electronic “Russian Classics Guide”); online courses about the digital representation of literary heritage; the integration of data on Russian lexis and grammar and on Russian literary heritage into global data storage systems (Linked Open Data).

Below the relevant parts of the project are described.

1) The project of digitization of Russian classical texts began two years ago and is headed by Anastasia Bonch-Osmolovskaya and Boris Orekhov. In 2014, a unique crowdsourcing project “All Tolstoy in one click” was launched by the Tolstoy Museum in Moscow and one of Russian’s top IT companies ABBYY, a leader in optical recognition. The 90 volume edition was digitized with the help of ABBYY OCR technology and then proofread by thousands of volunteers from 49 countries in two weeks. Now the works can be downloaded free of charge in popular e-books formats. The project’s objectives may be stated in the following way:

  1. to annotate all sorts of relevant data in Tolstoy’s works, using the TEI-framework and word sense tagging;
  2. to create a complete database of all named entities mentioned in the texts or commentaries;
  3. to link variants and drafts;
  4. to publish the results on the web providing an extensive search and visualization tools.

The first task has now been solved. The next stage of the project will be devoted to developing a biographical database. It has been demonstrated elsewhere that a large portion of bio facts may be obtained automatically (Bonch-Osmolovskaya, Kolbasov 2015). The further plans are to link the database items to external web sites, such as Wikipedia, DBpedia and relevant Linked Open data archives.

The research team is also working on a system of online courses on methods, problems and technical approaches to digitalization of literary heritage.

2) Another sub-project of the general project is focused on development of electronic dictionaries (the main participant is Olga Lyashevskaya). Based on the success of the “Frequency dictionary of modern Russian”, the group creates new corpus-based dictionaries. Here belongs, for instance, the project of frequency dictionary of inflectional paradigms (see a description at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2286096##).

The new corpus-based dictionaries should be an alternative of traditional dictionaries, which are not always linked to real language use and do not mark the degree of frequency of a unit. The frequency dictionary of inflectional paradigms, for instance, departs from traditional frequency lexicography in several ways:

  1. word forms are arranged in paradigms, so their frequencies can be compared and ranked;
  2. the dictionary is focused on the grammatical profiles of individual lexemes, rather than on the overall distribution of grammatical features (e.g., the fact that Future forms are used less frequently than Past forms);
  3. the grammatical profiles of lexical units can be compared against the mean scores of their lexicosemantic class;
  4. in each part of speech or semantic class, lexemes with certain biases in the grammatical profile can be easily detected (e.g. verbs used mostly in the Imperative, Past neutral, or nouns often used in the plural); and,
  5. the distribution of homonymous word forms and grammatical variants can be followed over time and within certain genres and registers.

The dictionary sub-project has also another, a long-term aim. The participants seek to develop a standardized computer-readable format that allows a user to bring together all existing lexicographic data on contemporary and diachronic conditions of the Russian language.

3) The project of corpus-oriented Russian grammar is being developed by scholars from the School of Linguistics of Faculty of Humanities (Valentina Apresyan, Michael Daniel, Nina Dobrushina, Alexander Letuchiy, Olga Lyashevskaya, Ekaterina Rakhilina) together with specialists from the Russian Language Institute (Russian Academy of Sciences) and other institutions. For the time being, more than thirty chapters on various grammatical phenomena (imperative, conjunctions, reflexive, finiteness, and so on) are written. The grammar includes examples of real language use, taken from the Russian National Corpus (http://www.ruscorpora.ru/en/index.html). The language data is oriented to modern formal and functional approaches to linguistics.

Although Russian has a well-developed grammatical tradition, the existing academic grammars often lack connection with modern data. This is why the Rusgram project is really important for the description of Russian.

Alongside the Rusgram project, which is primarily concentrated on literary language, a new sub-project, focused on electronic communication, is being developed.