National Corpus is a collection of texts in electronic form, which is:
• representative (for a given period)
• balanced (all types and genres pro rata)
• annotated (supplied with a linguistically relevant information)
The paper presents the project of Russian National Corpus (RNC), supported by Russian Acad-emy of Sciences (ruscorpora.ru).
RNC is supposed to become an on-line information and retrieval system based
on Russian texts.
The main goals of the RNC are :
• Assistance in teaching Russian;
• Supporting all kinds of linguistic research, both descriptive and theoretical,
synchronic and diachronic (especially lexicographic and morphosyntactic studies).
The Corpus is also a powerful tool for observing language changes, especially small and gradual ones.
We shall concentrate on that part of the project which concerns the so called
Main Corpus: an-notated corpus of Modern Russian of the second half of XX century
(now more than 50 million words, by the end of 2005 up to 100 million words
available). It is representative and balanced, as it contains all types of written
texts, including fiction (both prose and drama), memoirs, news-paper accounts
and reviews, advertisements, texts on education, engineering, science, philoso-phy,
religion, business, law, as well as texts of private use non intended for publication
(diaries, private correspondence, etc.). In addition, the Main corpus has several
types of annotation, among them are:
• morphological annotation (which is most important for Russian);
• meta-textual annotation (which allows to choose a subcorpus for specific
search and is a remarkable know-how of RNC);
• accentual annotation (within the morphologically disambiguated subcorpus).
We would like to discuss the further development of RNC which goes along two main direc-tions: adding new texts and refining the system of annotation. In particular, we intend to inte-grate spoken texts into the Corpus, which allows us to record the most recent trends in Modern Russian.
Another important improvement is an innovative system of semantic annotation. The annotation is based on a taxonomic classification of Russian lexicon (“Lexicographer”) elaborated in VI-NITI with the collaboration of Paducheva, Kustova and others. The examples of possible re-quests combining semantic and morphological information are discussed.