CLLS 2018:Papers with Abstracts

Abstract. In this paper we present an application of the specific neural network model for sentiment attitude extraction without handcrafted NLP features implementation. Given a mass-media article with the list of named entities mentioned in it, the task is to extract sentiment relations between these entities. We considered this problem for the whole documents as a three-class machine learning task. The modified architecture of the Convolutional Neural Networks were used and called as Piecewise Convolutional Neural Network (PCNN). The latter exploits positions of named entities in text to emphasize aspects for inner and outer contexts of relation between entities. For the experiments, the RuSentRel corpus was used, it contains Russian analytical texts in the domain of international relations.
Abstract. Automatic morphological analysis is one of the fundamental and significant tasks of NLP (Natural Language Processing). Due to special features of Internet texts, as they can be both normative texts (news, fiction, nonfiction) and less formal texts (such as blogs and texts from social networks), the morphological tagging has become non-trivial and an actual task. In this paper we describe our experiments in tagging of Internet texts presenting our approach based on deep learning. The new social media test set was created, that allows to compare our system with state-of-the-art open source analyzers on the social media texts material.
Abstract. We explore a recently proposed question answering system. We developed a high speed modification based on dividing the question answering system into three consecutive stages. The first step is to find the candidate documents that most likely contain the answer to the question. The second step is to rank sentences by the probability of having a correct answer to the question. The third step is to find the exact phrase that answers the question. At the third step we used a recently proposed recurrent bidirectional neural network predicting the beginning and the end of a response. In this paper we showed that the proposed question answering system allows to speed up its work without significant losses in the quality. For each step we also explored the feature space construction techniques allowing to improve the final quality.
Abstract. Coreference resolution is recognized as an important task in natural text processing and it has been proven that knowledge of semantic relations between two possibly coreferent entities can provide a certain increase in quality for automated solutions. One of the ways to integrate semantic information in such a system is to measure semantic relatedness between candidates for establishing coreference relation. This research is devoted to evaluating the efficiency of different types of semantic relatedness metrics, calculated from different sources, for coreference resolution on the material of Russian language.
Abstract. The Japanese language has a great variety of verb inflectional suffixes (auxiliaries), each having conjugation of their own. In this paper we propose a corpus-based approach to studying Japanese verb paradigms. Such an approach benefits from identifying possible verb forms on big data of written language. Description of methods and tools used for building databases of verbs and auxiliaries and for parsing verb 7-grams from a Japanese N-gram Corpus is presented.
Abstract. Information extraction and sentiment analysis of social net contents is a relatively new field, but a very promising and quickly developing one. In the current work, we discuss the possibilities of automatic detection and analysis of ironic and sarcastic messages in one of the most popular (among both users and researches) social nets - Twitter. A particular trait of this research will consist in analyzing the languages content, which is rarely observed. We will focus on the material of Spanish, German and Russian languages, trying to implement the same analysis algorithms to all three as well and to define the differences.
Abstract. The paper describes two hybrid neural network models for named entity recognition (NER) in texts, as well as results of experiments with them. The first model, namely Bi-LSTM-CRF, is known and used for NER, while the other model named Gated-CNN- CRF is proposed in this work. It combines convolutional neural network (CNN), gated linear units, and conditional random fields (CRF). Both models were tested for NER on three different language datasets, for English, Russian, and Chinese. All resulted scores of precision, recall and F1-measure for both models are close to the state-of-the-art for NER, and for the English dataset CoNLL-2003, Gated-CNN-CRF model achieves 92.66 of F1-measure, outperforming the known result.
Abstract. In this paper, we address the problem of automatic extraction of discourse formulae. By discourse formulae (DF) we mean a special type of constructions at the discourse level, which have a fixed form and serve as a typical response in the dialogue. Unlike traditional constructions [4, 5, 6], they do not contain variables within the sequence; their slots can be found in the left-hand or right-hand statements of the speech act. We have developed the system that extracts DF from drama texts. We have compared token-based and clause- based approaches and found the latter performing better. The clause-based model involves a uniform weight vote of four classifiers and currently shows the precision of 0.30 and the recall of 0.73 (F1-score 0.42).The created module was used to extract a list of DF from 420 drama texts of XIX-XXI centuries [1, 7]. The final list contains 3000 DF, 1800 of which are unique. Further development of the project includes enhancing the module by extracting left context features and applying other models, as well as exploring what DF concept looks like in other languages.
Abstract. This article is devoted to the problem of defining a genre in computer linguistics and searching for parameters that could formalize the concept of a genre. All kinds of existing typologies of genres rely on different types of features, whereas in the practice of NLP, any modern applications are adapted to learning on big data, and therefore - on text features that do not require additional non-automatic markup. Based on such text-internal features, in this article we focus on differentiation of various genres and their grouping on the basis of a similar distribution of features. The description of the contribution of various types of features to the final result and their interpretation are given, and also an analysis of how such features can be used to further adaptation of NLP models is provided. The materials of the "Taiga" corpus with genre annotation are used as experimental data.
Abstract. In this paper we address the semantics of temporal expressions in natural language (such as vchera,’yesterday’, shestnadcatogo maja, ’on the 16th of May’, tri dnja ’three days’) and the way they interact with some other manifestations of temporality (such as functioning of prepositions and aspectual verb forms). A formal and constituent description of heterogeneous temporal expressions is proposed. We consider the interval algebra presented by James Allen to be the right basis for such a description. The new formal system is compared with the known TimeML project. The latter has weak spots - the meaning of some temporal expressions simply can not be represented in terms of TimeML. We discuss such cases and show how to analyze them in our formal system.
Abstract. Scientific texts contain a lot of special terms, which together with their definitions present an important part of scientific knowledge to be extract- ed for various applications, such as text summarization, construction of glossaries and ontologies and so on. The paper reports rule-based methods developed for extracting terminological information involving recognition of term definitions, as well as detection of term occurrences within scientific or technical texts. In contrast to corpus-based terminology extraction, the developed methods are oriented to processing a single text and are based on lexico-syntactic patterns and rules representing specific linguistic information about terms in scientific texts. The formal language LSPL for specification of the patterns and rules is briefly characterized, which is supported with programming tools and used for information extraction. Two applications of the methods are discussed: formation of glossary for a given text document and subject index construction. For these applications, both collections of LSPL patterns and extraction strategies are described, and results of their experimental evaluation are given.
Abstract. In our research we studied the dependency structure of the text genre (love stories, detective stories, science fiction and fantasy). The novel characteristics (such syntactic attributes as verb constructions and construction of a specific cumulative threshold) which can be additional machine learning parameters were identified. We conducted experiment with novel features and showed that these characteristics can be useful for closely related genre recognition.
Abstract. Lemmatisation, which is one of the most important stages of text preprocessing, consists in grouping the inflected forms of a word together so they can be analysed as a single item, identified by the word’s lemma, or dictionary form. It is not a very complicated task for languages such as English, where a paradigm consists of a few forms close in spelling; but when it comes to morphologically rich languages, such as Russian, Hungarian or Irish, lemmatisation becomes more challenging. However, this task is often considered solved for most resource-rich modern languages irregardless of their morphological type. The situation is dramatically different for ancient languages characterised not only by a rich inflectional system, but also by a high level of orthographic variation, and, what is more important, a very little amount of available data. These factors make automatic morphological analysis of historical language data an underrepresented field in comparison to other NLP tasks. This work describes a case of creating an Early Irish lemmatiser with a character-level sequence-to-sequence learning method that proves efficient to overcome data scarcity. A simple character-level sequence-to-sequence model trained during 34,000 iterations reached the accuracy score of 99.2 % for known words and 64.9 % for unknown words on a rather small corpus of 83,155 samples. It outperforms both the baseline and the rule-based model described in [21] and [76] and meets the results of other systems working with historical data.