Russian National Corpus (RNC): an Overview and Perspectives

E.A. Grishina, Rakhilina E.V., Institute of Russian language, Russian Academy of Sciences; VINITI, Russian Academy of Sciences

National Corpus is a collection of texts in electronic form, which is:
• representative (for a given period)
• balanced (all types and genres pro rata)
• annotated (supplied with a linguistically relevant information)

The paper presents the project of Russian National Corpus (RNC), supported by Russian Acad-emy of Sciences (ruscorpora.ru).

RNC is supposed to become an on-line information and retrieval system based on Russian texts.
The main goals of the RNC are :
• Assistance in teaching Russian;
• Supporting all kinds of linguistic research, both descriptive and theoretical, synchronic and diachronic (especially lexicographic and morphosyntactic studies).

The Corpus is also a powerful tool for observing language changes, especially small and gradual ones.

We shall concentrate on that part of the project which concerns the so called Main Corpus: an-notated corpus of Modern Russian of the second half of XX century (now more than 50 million words, by the end of 2005 up to 100 million words available). It is representative and balanced, as it contains all types of written texts, including fiction (both prose and drama), memoirs, news-paper accounts and reviews, advertisements, texts on education, engineering, science, philoso-phy, religion, business, law, as well as texts of private use non intended for publication (diaries, private correspondence, etc.). In addition, the Main corpus has several types of annotation, among them are:
• morphological annotation (which is most important for Russian);
• meta-textual annotation (which allows to choose a subcorpus for specific search and is a remarkable know-how of RNC);
• accentual annotation (within the morphologically disambiguated subcorpus).

We would like to discuss the further development of RNC which goes along two main direc-tions: adding new texts and refining the system of annotation. In particular, we intend to inte-grate spoken texts into the Corpus, which allows us to record the most recent trends in Modern Russian.

Another important improvement is an innovative system of semantic annotation. The annotation is based on a taxonomic classification of Russian lexicon (“Lexicographer”) elaborated in VI-NITI with the collaboration of Paducheva, Kustova and others. The examples of possible re-quests combining semantic and morphological information are discussed.