LABORATORY 15

Laboratory of Computational Linguistics

Head of Laboratory – Dr.Sc. (Linguistics) Igor Boguslavsky

Tel.: (095) 299-49-27; E-mail: bogus@iitp.ru

 

The leading researchers of the laboratory include:

Full member of the Russian Academy of Sciences, Dr.Sc. (Linguistics) Jury D. Apresjan

Dr. Sc. (Linguistics).

Vladimir Z. Sannikov

Nikolay V. Grigoriev

Dr.

Leonid L. Iomdin

Alexander V. Lazursky

Dr.

Leonid G. Mitjushin

Irina Sagalova

Dr.

Leonid L. Tsinman

Victor G. Sizov

Dr.

Svetlana A. Grigoriev a

 

RESEARCH ACTIVITIES

The main problem area of the Laboratory is the study of functioning of natural language as a means of information transmission.

Basic research activities pursued in the laboratory are oriented towards the development of a fully operational formal model of language of the Meaning Û Text class. This model simulates human linguistic behavior, including the basic ability of man to produce and comprehend natural language texts.

PRINCIPAL RESULTS

All scientific results obtained in 2000 bear upon the enhancement of the scope of functional possibilities of the multipurpose NLP system, ETAP-3. A demo version of the system is available over the Internet at http://proling.iitp.ru.

  1. New versions of combinatorial dictionaries of Russian and English have been developed. Each of the two dictionaries now counts over 50,000 lexical entries, which is comparable in size with large traditional bilingual general-purpose dictionaries. The dictionaries have not only been enlarged but thoroughly revised in accordance with recent achievements of theoretical lexicography. As a result, many aggregated polysemous entries have been systematically split into separate word-meanings, which contributes to a much more adequate presentation of the material and to an improvement of machine translation quality.
  2. Basic changes have been made in government patterns (subcategorization frames) of predicate words both in the Russian and in the English combinatorial dictionaries. The alterations have been introduced on the basis of an entirely new semantic and syntactic classification of predicates, which takes into account the identity of government patterns and similarity of meanings of the lexemes. The new classification was necessary because the practice of experimenting with a variety of NLP systems and a computer-assisted language learning system had revealed a number of inconsistencies and contradictions in the semantic and syntactic treatment of government peculiarities of predicate words. The cause of these gaps was the fact that the corresponding zones of lexical entries of the combinatorial dictionaries had been elaborated without taking due account of the whole semantic and syntactic class to which concrete words belonged.
  3. The Russian morphological dictionary has been considerably expanded. Its size has reached the level of 120,000 lexical units (which is probably a record value for computerized dictionaries of Russian that contain full scale morphological data) subsuming a large number of proper names, idiomatic composite words, syntactic derivatives, and neologisms.
  4. A new morphological analyzer has been designed. The morphological analysis, aimed at the definition of grammatical features of words in the text, is an important integral part of any text processing procedure. The new analyzer, based on a modern finite state technology, is characterized by

  1. The parsing module of the ETAP-3 system has been improved. A weighting mechanism has been introduced into the module to assess the relative probability of hypothetical syntactic links and homonyms. The purpose of the mechanism is to help distinguish between the core syntactic phenomena and the core lexical units of a language, on the one hand, and the peripheral syntactic phenomena and lexical units, on the other hand. The new mechanism allows a significant increase in the accuracy of selecting the correct syntactic structure from a set of possible options. In addition, this module improves the performance of the parsing algorithm itself because it helps to automatically gear the algorithm to the priority production of the most probable structure. In future, the weights module is expected to work with the results of statistic pre-processing of text corpora, which will contribute to a further improvement of parsing.
  2. A new version of lexical functions repertory has been designed. The canonical theory of lexical functions proceeded from the assumption that semi-auxiliary verbs of the OPER-FUNC family, which serve as a basis for paraphrasing of utterances in the natural language, are bivalent. Accordingly, the paraphrasing rules determined the arrangement of two arguments of the source predicate with respect to the semi-auxiliary verb. However, natural languages have a large number of semi-auxiliary verbs that inherit several arguments of the keyword in addition to their own arguments. For this reason, the developers of the ETAP-3 system had to resort to individual ad hoc rules to ensure the arrangement of all arguments with respect to such polyvalent semi-auxiliaries. A generalized theory of lexical functions has been developed in 2000 to adequately cover the cases of polyvalent semi-auxiliaries.
  3. A generalized dictionary interface has been elaborated, which allows to completely separate the logic of MT system operation from the methods, techniques and media of linguistic data storage (database server, indexed files, RAM etc). The interface drastically simplifies the maintenance and configuring of the software complex and practically eliminates the need to maintain parallel versions for systems that make use of different systems of linguistic data storage.
  4. An algorithm of grammar and functional ambiguity resolution of Russian words using morphological data and linear context has been developed. The algorithm may be used in a variety of applications, including those aimed at a sophisticated search in texts, e.g. in Internet search engines.
  5. A new version of the deconversion module for the UNL language has been developed. The module is designed within the scope of an international project, “Universal Networking Language (UNL)”, aimed at overcoming language barriers in the Internet by providing users from different countries with an opportunity of communication in their own language. To meet this goal, a universal electronic interlingua UNL has been designed by the participants of the project. UNL may be used to represent any meaning that can be conveyed in a natural language. For every natural language two reciprocal procedures must be developed: the conversion procedure that translates a text written in this language into a UNL text, and the deconversion procedure that translates a UNL expression into a text in the given natural language. The new version of the UNL-Russian deconverter takes into consideration the position of every text element in the knowledge base. A website has been opened on which UNL structures can be deconverted into Russian sentences (http://www.unl.ru).
  6. A series of experiments have been staged to integrate the ETAP system with an online question answering system, IAW. The main goal of the experiments was to improve the performance of IAW through applying the ETAP parser to the queries. In the integrated system the workload of the modules is distributed as follows: the morphological and the syntactic analysis of the query (in English) and the generation of its Prolog representation are performed by ETAP, while IAW uses the text plan produced by ETAP to generate an answer to be given to the user. The comparison of IAW performance before and after the integration with ETAP has shown that if the syntactic structure produced by ETAP is taken into account, the IAW efficiency is increased by 7%.
  7. A tagged corpus of Russian texts has been created, in which every word is supplied with morphological features and every sentence with a dependency tree structure. The corpus size is about 10,000 sentences.

GRANTS From:

Publications in 2000

  1. Apresjan, Ju. Systematic Lexicography. Oxford University Press, 2000, XVIII, 304 pp.
  2. Apresjan Ju., I. M. Boguslavsky, L. L. Iomdin, L. L. Tsinman. Lexical Functions in NLP: Possible Uses. In: Computational Linguistics for the New Millenium: Divergence or Synergy. Heidelberg, 2000, p. 1-11.
  3. Boguslavsky I. UNL from the linguistic point of view (in print)
  4. Boguslavsky I. Even in discourse: Interaction of lexical meanings and interpretation strategies (in print).
  5. Boguslavsky I., S. Grigorieva, N. Grigoriev, L. Kreidlin, N. Frid. Dependency Treebank for Russian: Concept, Tools, Types of Information // Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000), 2000, p. 987-991.
  6. Boguslavsky I., N. Frid, L. Iomdin, L. Kreidlin, I. Sagalova, V. Sizov. Creating a Universal Networking Language Module within an Advanced NLP System // Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000), 2000, p. 83-89.
  7. Carl M., L. Iomdin, C. Pease, O. Streiter. Towards a Dynamic Linkage of Example-Based and Rule-Based Machine Translation // MT (In print).
  8. Streiter O., L. Iomdin, I. Sagalova. Learning Lessons from Bilingual Corpora: Benefits for Machine Translation // International Journal of Corpus Linguistics. Vol. 5 (2), 2000, p. 1-32.
  9. Streiter O., M. Carl, L. Iomdin. A Virtual Translation Machine for Hybrid Machine Translation // Труды Международного семинара Диалог’2000 по компьютерной лингвистике и ее приложениям. Том 2. Протвино, 2000. С. 382-393.
  10. Апресян Ю. Д., В. В. Ботякова, Т. Э. Латышева и др. Англо-русский синонимический словарь. М.: Русский язык, 2000, изд. 5-е, стереотипное, 543 c. (переиздание).
  11. Апресян Ю. Д., Л. Л. Иомдин, Э. М. Медникова, А. В. Петрова и др. Новый большой англо-русский словарь. М.: Русский Язык, 2000. Изд. 5-е, стереотипное. T. I, 832 c., T. II, 828 c., T. III, 824 c. (переиздание).
  12. Апресян Ю. Д., О. Ю. Богуславская, Т. В. Крылова, И. Б. Левонтина, Е. В. Урысон и др.) Новый объяснительный словарь синонимов русского языка. Второй выпуск. М.: Языки русской культуры, 2000, 487 с.
  13. Апресян Ю. Д. Предисловие к Новому объяснительному словарю синонимов русского языка. Изд. Второе, исправленное // Новый объяснительный словарь синонимов русского языка. Второй выпуск. М.: Языки русской культуры, 2000, V-VII.
  14. Апресян Ю. Д. Словарная статья словаря синонимов // Новый объяснительный словарь синонимов русского языка. Изд. Второе, исправленное // Новый объяснительный словарь синонимов русского языка. Второй выпуск. М.: Языки русской культуры, 2000, VIII-XVII.
  15. Апресян Ю. Д. Лингвистическая терминология словаря синонимов // Новый объяснительный словарь синонимов русского языка. Изд. Второе, исправленное // Новый объяснительный словарь синонимов русского языка. Второй выпуск. М.: Языки русской культуры, 2000, XVIII-XLV.
  16. Апресян Ю. Д. Многозначность и синонимия слова любить // Etnolingwistika. Problemy języka i kultury. 12. Lublin, 2000, c. 77-95.
  17. Апресян Ю. Д. Остановка движения как симптом внутреннего состояния: синонимический ряд замереть // Отцы и дети Московской лингвистической школы. Сборник статей в честь В.Н. Сидорова (в печати).
  18. Апресян Ю. Д. О лексических функциях семейства REAL – FACT // Сборник в честь Z. Saloni (в печати).
  19. Апресян Ю. Д. Глагол заставлять: семантический класс, синонимия, многозначность // Cборник в честь М. В. Панова (в печати).
  20. Апресян Ю. Д. От значения к несемантическим свойствам лексем: знание и мнение // Сборник докладов международного симпозиума в Экс-ан-Прованс в мае 2000 года (в печати).
  21. Апресян Ю. Д. Наказание в языковой картине мира // Сборник статей в честь 70-летия проф. А. Богуславского (в печати).
  22. Апресян Ю. Д. О системообразующих смыслах ‘знать’ и ‘считать’ в русском языке // Русский язык, № 1 (в печати).
  23. Апресян В. Ю., С. А. Григорьева. Волшебство в языке. Слово в тексте и в словаре // Сборник статей в честь 70-летия акад. Ю. Д. Апресяна. М.: Языки русской культуры, 2000.
  24. Богуславский И. М., Н. В. Григорьев, С. А. Григорьева, Л. Л. Иомдин, Л. Г. Крейдлин, В. З. Санников, Н. Е. Фрид. Аннотированный корпус русских текстов: концепция, инструменты разметки, типы информации // Труды Международного семинара Диалог’2000 по компьютерной лингвистике и ее приложениям. Том 2. Протвино, 2000. С. 41-47.
  25. Богуславский И. М., Л. Л. Иомдин, Л. Г. Крейдлин, Н. Е. Фрид, И. Л. Сагалова, В. Г. Сизов. Модуль универсального сетевого языка в составе системы ЭТАП-3 // Труды Международного семинара Диалог’2000 по компьютерной лингвистике и ее приложениям. Том 2. Протвино, 2000. С. 48-58.
  26. Богуславский И. М., Л. Л. Иомдин. Семантика медленности // Слово в тексте и в словаре. Сборник статей в честь 70-летия акад. Ю. Д. Апресяна. М.: Языки русской культуры, 2000. С. 52-60.
  27. Григорьева С. А.  Нетривиальная семантическая сфера действия лексемы: случайность или закономерность? // Труды Международного семинара Диалог-2000 по компьютерной лингвистике и ее приложениям. Протвино, 2000, т. 1, с. 61.
  28. Иомдин Л. Л.  Синтаксические особенности фразеологических единиц: новые подробности // Сборник статей в честь 70-летия проф. А. Богуславского (в печати).
  29. Санников В.З. О значении союза пускай/пусть // Отцы и дети Московской лингвистической школы. Сборник статей в честь В. Н. Сидорова (в печати).
  30. Цинман Л. Л., В. Г. Сизов. Лингвистический процессор ЭТАП: дескрипторное соответствие и обработка метафор // Труды Международного семинара Диалог ‘2000 по компьютерной лингвистике и ее приложениям. Т. 2. Протвино 2000. С. 366-369.
  31. Цинман Л. Л., В. Г. Сизов. Лингвистический процессор ЭТАП: процедура ослабления синтаксических правил и ее использование. // Слово в тексте и в словаре. Сборник статей в честь 70-летия акад. Ю. Д. Апресяна. М.: Языки русской культуры, 2000. С. 521-528.