Moscow, 21/4 Staraya Basmannaya Ulitsa
Phone: +7 (495) 772-95-90 *22734
The poetic texts pose a challenge to full morphological tagging and lemmatization since the authors seek to extend the vocabulary, employ morphologically and semantically deficient forms, go beyond standard syntactic templates, use non-projective constructions and non-standard word order, among other techniques of the creative language game. In this paper we evaluate a number of probabilistic taggers based on decision trees, CRF and neural network algorithms as well as a state-of-the-art dictionary-based tagger. The taggers were trained on prosaic texts and tested on three poetic samples of different complexity. Firstly, we suggest a method to compile the gold standard datasets for the Russian poetry. Secondly, we focus on the taggers’ performance in the identification of the part of speech tags and lemmas. We reveal what kind of POS classes, paradigm classes and syntactic patterns mostly affect the quality of processing.
This paper discusses novel facts regarding adpositional agreement in Avar in light of recent theories of feature valuation. I show that the traditional notion of downward Agree/upward valuation is sufficient to account for the observed facts, rendering the competing mechanism of upward Agree/downward valuation superfluous.
The 2019 Shared Task on Automatic Gapping Resolution for Russian (AGRR2019) aims to tackle non-trivial linguistic phenomenon, gapping, that occurs in coordinated structures and elides a repeated predicate, typically from the second clause. In this paper we define the task and evaluation metrics, provide detailed information on data preparation, annotation schemes and methodology, analyze the results and describe different approaches of the participating solutions.
The paper examines the properties of heavy as a perceptual concept, based on evidence from 11 languages. We demonstrate that the semantics of this concept is heterogeneous; lexemes of this field can be used in situations of at least three types: Lifting, Shifting and Weighing. These situations are either lexicalised as separate words or they converge in a single lexeme in various combinations following certain strategies. We also argue that different metaphorical extensions correspond to different situation types; this allows us to use analysis of metaphoric shifts as an additional instrument to establish the semantic structure of direct meanings.
The paper discusses the standardization efforts to create a morphological standard for the Middle Russian corpus, which is part of the historical collection of the Russian National Corpus (RNC). To meet the needs of different categories of corpus researchers as well as NLP developers, we consider two styles of the morphological annotation (RNC schema and Universal Dependencies schema). A number of specifications of the feature list proposed to facilitate data reusability, linking and conversion.
The paper reports a method to create a speaker’s prosodic fingerprint based on the global characteristics of the pitch movement. Prosodic fingerprint is the distribution of f0 in the low, middle, and high ranges and the distribution of pitch movements from one range into other [Šimko et al. 2017]. This fully automated method can be used to classify the records and to provide the reference level for more sophisticated analysis of the pitch movement and intonation strategies. We evaluate the method by applying it to the spontaneous Russian spoken data recorded in different regions. We model the correlation between the fingerprint and sociolinguistic features such as age, gender, and region. The results of this analysis allow to formulate several sociolinguistic hypotheses that can further be tested with a more detailed analytic technique.
Questionnaires constitute a crucial tool in linguistic typology and language description. By nature, a Questionnaire is both an instrument and a result of typological work: its purpose is to help the study of a particular phenomenon cross-linguistically or in a particular language, but the creation of a Questionnaire is in its turn based on the analysis of cross-linguistic data.
We attempt to alleviate linguist’s work by constructing lexical Questionnaires automatically prior to any manual analysis. A convenient Questionnaire format for revealing fine-grained semantic distinctions includes pairings of words with diagnostic contexts that trigger different lexicalizations across languages. Our method to construct this type of a Questionnaire relies on distributional vector representations of words and phrases which serve as input to a clustering algorithm. As an output, our system produces a compact prototype Questionnaire for cross-linguistic exploration of contextual equivalents of lexical items, with groups of three homogeneous contexts illustrating each usage. We provide examples of automatically generated Questionnaires based on 100 frequent adjectives of Russian, including veselyj ‘funny’, ploxoj ‘bad’, dobryj ‘kind’, bystryj ‘quick’, ogromnyj ‘huge’, krasnyj ‘red’, byvšij ‘former’ etc. Quantitative and qualitative evaluation of the Questionnaires confirms the viability of our method.
Udi is a Nakh-Daghestanian (Lezgic) language spoken in northern Azerbaijan, which has undergone many contact-induced changes due to the influence of unrelated languages of the eastern Caucasus (Indo-European, Turkic). A recent change is the borrowing of the conditional enclitic =sa from Azerbaijani (Turkic). In Udi, this marker can combine with finite indicative tenses, resulting in a series of derived ‘realis’ conditional mood forms. The clitic is also used to create an indefiniteness marker, which derives indefinite pronouns from interrogative ones. Prior to the borrowing of the Azerbaijani morpheme there was no comparable marker in Udi available to fulfil these functions, while other Lezgic languages employ their own native grammatical means for the same functions (conditional clitics or auxiliaries). The acquisition of the borrowed clitic has thus made Udi more and not less structurally isomorphic with respect to the other languages of the Lezgic branch. This paper develops a description of functions related to the domain of conditional mood on various stages of the history of Udi, and suggests a diachronic scenario for the borrowing of the Azerbaijani marker =sa.
This paper surveys relative clause constructions in West Circassian (Adyghe) and Kabardian.
The paper provides linguistic explanations to the results of the supervised machine learning experiments for identification of verbal metaphor in Russian texts. We look at the classification accuracy of models based on different features (distributional semantics, and lexical and morphosyntactic co-occurrence, etc.) and explore the behavior of verb constructions and wider context in order to investigate the reasons behind the most and the least successful performances.
The paper presents a methodology for an automatic construction of a lexical typological questionnaire based on the data from a monolingual Russian National Corpus. Using the domains ‘sharp’, ‘straight’, ‘thick’, and ‘smooth’ as a test dataset, we elaborate an algorithm that constructs a list of collocations for the corresponding Russian adjectives, computes vector representation for every collocation, clusters the vector space into semantically homogenous groups and extracts three central elements from every cluster. We compare the resulting questionnaires with the manually prepared ones, conclude that the suggested methodology demonstrates a high quality and can be implemented in the process of a lexical typological research.
The goals of research on conceptual metaphor in discourse are at present remarkably multifaceted, from describing specific social, pragmatic, rhetorical, aesthetic, and discursive functions in real discourse data, through assessing
metaphor entrenchment in the cultural and conceptual system, to identification methods as well as criteria for metaphorical mapping description and classification. The volume the reader is about to explore provides a broad panorama of perspectives tackling diverse aspects of metaphor analysis, including a wide range of topics such as the levels of source domain knowledge configuration, new Metaphor analysis in discourse. Introduction 7
target domain knowledge, conscious usage, metaphor identification procedures, communicative functions, linguistic metaphor, visual modes of metaphorical expression, corpus processing, trans-modal metaphor, among others. One of the assets of this collective work consists in showing how the scrutiny of metaphorical connections in multimodal discourse reveals the conceptual nature of metaphorical thinking. The book is organized in three parts, each one focussing on certain aspects of metaphor analysis in discourse. The first part emphasizes the description and characterization of metaphorical knowledge. The chapters offer a view on knowledge configurations like image schemas, frames, scenarios and domains that configure particular kinds of discourse and knowledge. The second
part puts the stress on communicative aspects, particularly on the analysis of author/speaker intentionality and the tools to measure intention and effect in metaphor usage. Finally, the third block in the volume delves into the intricacies of disclosing metaphorical codes in non-linguistic modes of semiosis, be it cartoons, film, or other visual media.
Relying upon the data from the Russian National Corpus, the paper studies Russian wh-exclamatives with and without predicates. Firstly, it makes a list of wh-exclamatives with each of the following eight wh-words: do čego, kak, kakoj, kakov, naskol’ko, skol’, skol’ko, čto za. Secondly, on the basis of the corpus frequencies of the established wh-exclamatives, it shows that those wh-exclamatives that involve NPs predominantly occur without predicates, whereas those wh-exclamatives that do not involve NPs predominantly occur with predicates. Thirdly, the paper reveals that without-predicates wh-exclamatives are mostly Nominative marked and their most frequent type, kakoj-exclamatives, involves either a scalar adjective or a scalar noun, if an NP lacks an adjective. Last but not least, the paper demonstrates which wh-constructions function only as exclamatives, that is, which of them are E-only in terms of Portner and Zanuttini (2003).
The paper traces the level of bilingualism in several highland villages of Daghestan (Northeast Caucasus) through the 20th century. We show that historically, men were more multilingual than women, but this was not true to the same extent for all languages. Highlanders’ repertoires suggest a correlation between the social function of the second language and the degree to which its command was gendered. We also explore the dynamics of multilingualism from the generation born at the end of the 19th century to the generation born in the 1990s. We show that during the 20th century local L2s were gradually displaced by Russian, and Daghestanian multilingualism lost its gendered character. We argue that these changes were caused by the introduction of Soviet schooling.
This article is devoted to the problem of defining a genre in computer linguistics and searching for parameters that could formalize the concept of a genre. All kinds of existing typologies of genres rely on different types of features, whereas in the practice of NLP, any modern applications are adapted to learning on big data, and therefore - on text features that do not require additional non-automatic markup. Based on such text-internal features, in this article, we focus on the differentiation of various genres and their grouping on the basis of a similar distribution of features. The description of the contribution of various types of features to the final result and their interpretation are given, and also an analysis of how such features can be used to further adaptation of NLP models is provided. The materials of the "Taiga" corpus with genre annotation are used as experimental data.
Drawing upon recent insights into the role of Goal preference as reflector of cross-linguistic differences, this paper investigates the factors affecting the realization of Goals in motion event descriptions. In particular, it examines the interplay between the lexicalization pattern of a language, on the one hand, and grammatical viewpoint aspect, on the other – factors which have commonly been treated in isolation. In so doing, three typologically distinct languages were examined: English, German and Greek. The empirical basis of this paper includes: (a) a corpus study, in which we examined the distribution of Goals in a small set of verbs, and (b) an experimental verbalization study, from which we elicited descriptions of different motion event types. While the former does not give a clear picture concerning the cross-linguistic differences in Goal prominence, the latter indicates that lexicalization pattern assumes a more prominent role than grammatical viewpoint aspect in affecting Goal realization.
Head/dependent marking is a typological parameter based on whether syntactic relations, or dependencies, are marked on the head of the relation, on the non-head, on both, on neither, or elsewhere in the constituent. It has been visible in description and comparison for some thirty years, during which time advances in analysis of phrase structure and descriptions of previously unnoticed patterns have revealed some imprecisions and gaps in the typology. That approach has figured in descriptive and theoretical work of various kinds and has proven quite useful as far as it goes, but the expansion of descriptive and theoretical work on morphosyntax in the subsequent decades has revealed some gaps and inconsistencies in the original formulation. These can be removed by allowing markers to be assigned not to words but to entire phrases, a move that also allows detached and neutral marking to be more comfortably accommodated in locus theory.
This chapter focuses on imageries and historical change in the European Russian Arctic.
The paper focuses on a two aspectual morphemes in Moksha Mordvin (< Mordvin < Finno-Ugric). The first of them, the Frequentative, has four phonologically conditioned allomorphs, -ənd-, -n’ə-, -s’ə-, and -kšn’ə-. These affixes used to be sepa-rate morphemes in Proto-Finno-Ugric, but ended up as having the same meaning and being complementarily distributed. A remnant of a more archaic stage of lan-guage evolution is the Avertive marker, -əkšn’ə-, only different from one of the Fre-quentative allomorphs by one phoneme, which can hardly be a coincidence. A dia-chronic hypothesis about how iterative-avertive polyfunctionality could have arisen is suggested.
Lemmatisation, which is one of the most important stages of text preprocessing, consists in grouping the inflected forms of a word together so they can be analysed as a single item. This task is often considered solved for most modern languages irregardless of their morphological type, but the situation is dramatically different for ancient languages. Rich inflectional system and high level of orthographic variation common to these languages together with lack of resources make lemmatising historical data a challenging task. It becomes more and more important as manuscripts are being extensively digitized now, but still remains poorly covered in literature. In this work, I compare a rule-based and a neural network based approach to lemmatisation in case of Early Irish data.