Moscow, 21/4 Staraya Basmannaya Ulitsa
Phone: +7 (495) 772-95-90 *22734
Ekaterina V. Rakhilina
The poetic texts pose a challenge to full morphological tagging and lemmatization since the authors seek to extend the vocabulary, employ morphologically and semantically deficient forms, go beyond standard syntactic templates, use non-projective constructions and non-standard word order, among other techniques of the creative language game. In this paper we evaluate a number of probabilistic taggers based on decision trees, CRF and neural network algorithms as well as a state-of-the-art dictionary-based tagger. The taggers were trained on prosaic texts and tested on three poetic samples of different complexity. Firstly, we suggest a method to compile the gold standard datasets for the Russian poetry. Secondly, we focus on the taggers’ performance in the identification of the part of speech tags and lemmas. We reveal what kind of POS classes, paradigm classes and syntactic patterns mostly affect the quality of processing.
This paper discusses novel facts regarding adpositional agreement in Avar in light of recent theories of feature valuation. I show that the traditional notion of downward Agree/upward valuation is sufficient to account for the observed facts, rendering the competing mechanism of upward Agree/downward valuation superfluous.
The paper examines the properties of heavy as a perceptual concept, based on evidence from 11 languages. We demonstrate that the semantics of this concept is heterogeneous; lexemes of this field can be used in situations of at least three types: Lifting, Shifting and Weighing. These situations are either lexicalised as separate words or they converge in a single lexeme in various combinations following certain strategies. We also argue that different metaphorical extensions correspond to different situation types; this allows us to use analysis of metaphoric shifts as an additional instrument to establish the semantic structure of direct meanings.
The paper discusses the standardization efforts to create a morphological standard for the Middle Russian corpus, which is part of the historical collection of the Russian National Corpus (RNC). To meet the needs of different categories of corpus researchers as well as NLP developers, we consider two styles of the morphological annotation (RNC schema and Universal Dependencies schema). A number of specifications of the feature list proposed to facilitate data reusability, linking and conversion.
The paper reports a method to create a speaker’s prosodic fingerprint based on the global characteristics of the pitch movement. Prosodic fingerprint is the distribution of f0 in the low, middle, and high ranges and the distribution of pitch movements from one range into other [Šimko et al. 2017]. This fully automated method can be used to classify the records and to provide the reference level for more sophisticated analysis of the pitch movement and intonation strategies. We evaluate the method by applying it to the spontaneous Russian spoken data recorded in different regions. We model the correlation between the fingerprint and sociolinguistic features such as age, gender, and region. The results of this analysis allow to formulate several sociolinguistic hypotheses that can further be tested with a more detailed analytic technique.
Questionnaires constitute a crucial tool in linguistic typology and language description. By nature, a Questionnaire is both an instrument and a result of typological work: its purpose is to help the study of a particular phenomenon cross-linguistically or in a particular language, but the creation of a Questionnaire is in its turn based on the analysis of cross-linguistic data.
We attempt to alleviate linguist’s work by constructing lexical Questionnaires automatically prior to any manual analysis. A convenient Questionnaire format for revealing fine-grained semantic distinctions includes pairings of words with diagnostic contexts that trigger different lexicalizations across languages. Our method to construct this type of a Questionnaire relies on distributional vector representations of words and phrases which serve as input to a clustering algorithm. As an output, our system produces a compact prototype Questionnaire for cross-linguistic exploration of contextual equivalents of lexical items, with groups of three homogeneous contexts illustrating each usage. We provide examples of automatically generated Questionnaires based on 100 frequent adjectives of Russian, including veselyj ‘funny’, ploxoj ‘bad’, dobryj ‘kind’, bystryj ‘quick’, ogromnyj ‘huge’, krasnyj ‘red’, byvšij ‘former’ etc. Quantitative and qualitative evaluation of the Questionnaires confirms the viability of our method.
This paper surveys relative clause constructions in West Circassian (Adyghe) and Kabardian.
The paper provides linguistic explanations to the results of the supervised machine learning experiments for identification of verbal metaphor in Russian texts. We look at the classification accuracy of models based on different features (distributional semantics, and lexical and morphosyntactic co-occurrence, etc.) and explore the behavior of verb constructions and wider context in order to investigate the reasons behind the most and the least successful performances.
The paper presents a methodology for an automatic construction of a lexical typological questionnaire based on the data from a monolingual Russian National Corpus. Using the domains ‘sharp’, ‘straight’, ‘thick’, and ‘smooth’ as a test dataset, we elaborate an algorithm that constructs a list of collocations for the corresponding Russian adjectives, computes vector representation for every collocation, clusters the vector space into semantically homogenous groups and extracts three central elements from every cluster. We compare the resulting questionnaires with the manually prepared ones, conclude that the suggested methodology demonstrates a high quality and can be implemented in the process of a lexical typological research.
Relying upon the data from the Russian National Corpus, the paper studies Russian wh-exclamatives with and without predicates. Firstly, it makes a list of wh-exclamatives with each of the following eight wh-words: do čego, kak, kakoj, kakov, naskol’ko, skol’, skol’ko, čto za. Secondly, on the basis of the corpus frequencies of the established wh-exclamatives, it shows that those wh-exclamatives that involve NPs predominantly occur without predicates, whereas those wh-exclamatives that do not involve NPs predominantly occur with predicates. Thirdly, the paper reveals that without-predicates wh-exclamatives are mostly Nominative marked and their most frequent type, kakoj-exclamatives, involves either a scalar adjective or a scalar noun, if an NP lacks an adjective. Last but not least, the paper demonstrates which wh-constructions function only as exclamatives, that is, which of them are E-only in terms of Portner and Zanuttini (2003).
The paper traces the level of bilingualism in several highland villages of Daghestan (Northeast Caucasus) through the 20th century. We show that historically, men were more multilingual than women, but this was not true to the same extent for all languages. Highlanders’ repertoires suggest a correlation between the social function of the second language and the degree to which its command was gendered. We also explore the dynamics of multilingualism from the generation born at the end of the 19th century to the generation born in the 1990s. We show that during the 20th century local L2s were gradually displaced by Russian, and Daghestanian multilingualism lost its gendered character. We argue that these changes were caused by the introduction of Soviet schooling.
Drawing upon recent insights into the role of Goal preference as reflector of cross-linguistic differences, this paper investigates the factors affecting the realization of Goals in motion event descriptions. In particular, it examines the interplay between the lexicalization pattern of a language, on the one hand, and grammatical viewpoint aspect, on the other – factors which have commonly been treated in isolation. In so doing, three typologically distinct languages were examined: English, German and Greek. The empirical basis of this paper includes: (a) a corpus study, in which we examined the distribution of Goals in a small set of verbs, and (b) an experimental verbalization study, from which we elicited descriptions of different motion event types. While the former does not give a clear picture concerning the cross-linguistic differences in Goal prominence, the latter indicates that lexicalization pattern assumes a more prominent role than grammatical viewpoint aspect in affecting Goal realization.
Head/dependent marking is a typological parameter based on whether syntactic relations, or dependencies, are marked on the head of the relation, on the non-head, on both, on neither, or elsewhere in the constituent. It has been visible in description and comparison for some thirty years, during which time advances in analysis of phrase structure and descriptions of previously unnoticed patterns have revealed some imprecisions and gaps in the typology. That approach has figured in descriptive and theoretical work of various kinds and has proven quite useful as far as it goes, but the expansion of descriptive and theoretical work on morphosyntax in the subsequent decades has revealed some gaps and inconsistencies in the original formulation. These can be removed by allowing markers to be assigned not to words but to entire phrases, a move that also allows detached and neutral marking to be more comfortably accommodated in locus theory.
This chapter focuses on imageries and historical change in the European Russian Arctic.
The paper focuses on a two aspectual morphemes in Moksha Mordvin (< Mordvin < Finno-Ugric). The first of them, the Frequentative, has four phonologically conditioned allomorphs, -ənd-, -n’ə-, -s’ə-, and -kšn’ə-. These affixes used to be sepa-rate morphemes in Proto-Finno-Ugric, but ended up as having the same meaning and being complementarily distributed. A remnant of a more archaic stage of lan-guage evolution is the Avertive marker, -əkšn’ə-, only different from one of the Fre-quentative allomorphs by one phoneme, which can hardly be a coincidence. A dia-chronic hypothesis about how iterative-avertive polyfunctionality could have arisen is suggested.
Lemmatisation, which is one of the most important stages of text preprocessing, consists in grouping the inflected forms of a word together so they can be analysed as a single item. This task is often considered solved for most modern languages irregardless of their morphological type, but the situation is dramatically different for ancient languages. Rich inflectional system and high level of orthographic variation common to these languages together with lack of resources make lemmatising historical data a challenging task. It becomes more and more important as manuscripts are being extensively digitized now, but still remains poorly covered in literature. In this work, I compare a rule-based and a neural network based approach to lemmatisation in case of Early Irish data.
This paper describes the distribution of colour adjectives in Russian poetry of the Silver Age and defines individual preferences with regard to poetic tradition, syllable structure, and metrical restrictions. The research method combines a lexico-semantic approach, formal literary analysis, and quantitative metrics obtained via the frequency database of the Russian Poetry Corpus (over 10 M words, incl. 1 M adjectives). The database allows the user to compare subcorpora and create graphs of timeline distribution, which demonstrate that the lexical diversity and relative frequencies of colour adjectives start to grow rapidly in the 1890s, as modernists employ colour adjectives to upgrade the poetic inventory. The adjectives referring to non-banal hues (e.g. fioletovyj ‘violet’, lazorevyj ‘azur’) belong to the middle part of the ranked wordlist. Correspondence analysis of the data reveals individual colour preferences and stylistic similarities among the most prominent poets of the Silver Age; for example, Anna Akhmatova and Alexander Blok are similar regarding their use of the white hues. The distribution of the selected colour hue adjectives across metrical types highlights the strong association of multi-syllabic adjectives with certain meters, although some words have a more complex distribution.
The article compares the qualities ‘sharp’ and 'blunt' in 20 languages. We show that they tend to be unequal, with bluntness being negatively defined through sharpness. The two main oppositions in the domain are 1) the type of sharp object, and 2) the sense through which the quality is primarily experienced. The first opposition divides objects into bladed (knives etc) and pointed (needles etc), the second deals with touch vs. vision and translates to function (sharp/blunt instruments etc) vs. shape (pointed/rounded features etc).
We also find that these oppositions determine the semantic shifts that a word of sharpness or bluntness can have, and that the metaphoric patterns are consistent across languages.
Northwest Caucasian languages display a high degree of polysynthesis (manifested in complex words which bear much information on arguments and the characteristics of a situation), prefixes and suffixes, with some morphemes being capable to appear both as prefixes and suffixes, ergative-based cross-reference of core arguments and indirect objects introduced by applicatives, highly developed means of expressing locational semantics within the predicate, and intricate tense-modality-aspect systems. Although classical noun-to-verb incorporation does not occur, there are constructions akin to incorporation, especially in the nominal domain. Nouns constitute a subclass of a broad class of predicates (both morphologically and syntactically) and form word-like nominal complexes with their attributes. Morphemes demonstrate features which are not typical of morphemes in Standard Average European languages, including much autonomy reflected in affix order variation and ability to attach to complex syntactic constituents.
Narrative competence is an essential part of language proficiency. Research of narrative competence has both a theoretical and empirical value. Our study aims to assess narrative competence of adult L2 Russian learners and to investigate the relationship between their narrative competence and their language proficiency. For assessment, we used the Multilingual Assessment Instrument for Narratives adapted for the Russian language. We also designed a scale for assessing microstructure in Russian narratives. The study uses both qualitative and quantitative analysis. The results show that macrostructural narrative subcompetence of L2 Russian learners does not depend on their language proficiency (except for an ability to produce structurally shorter episodes at higher level) and microstructural narrative subcompetence of L2 Russian learners depends on their language proficiency only in some ways. Our study contributes to the theory of narrative competence in L2 acquisition.