Products of the same type(s)
Products of the same team

Linguistic resources of
RUSSICON Ltd


Russicon text corpora:
  1. RUSSICON Reference Corpus
    The corpus contains more than 150,000,000 word occurrences. It is based on a wide representation of texts of the following types: Russian literature (including Russian XX Century Literature Corpus, described below), critics, philosophy, religion, newspapers, memoirs, law, business documents, computer documentation, historical documents, protocols, translations, folklore (songs, jokes, Internet/CD literature, etc.), "underground" literature, etc.Among the authors one can find about 500 Russian writers well-known in the XVII — XX centuries, approximately 200 names of Russian famous philosophers, theologians, critics, politicians and memoirists. The corpora files are prepared in text, HTML and SGML formats. Conversion to SGML was done by means of special conversion utilities and with the help of SoftQuad SGML Publishing Suite. Currently, only a part of the whole corpus (approximately 5 mln word occurrences) is linguistically encoded (tagged). The texts for this so-called linguistically encoded corpus were selected in such a way that every author is presented by at least 4000-5000 words' worth of text fragments. Every word of the linguistically encoded corpus corresponds to an entry in one or several Russicon dictionaries. Due to Russicon reference corpora development, the team has already accumulated an extensive glossary consisting of more than 500,000 lemmas, but only about 200,000 of them have been already processed, i.e. analysed, considered and included into dictionaries. The processing of the glossary is being continued.
  2. RUSSICON Russian XX Century Literature Corpus
    The corpus contains about 5,000,000 word occurrences. It consists of more than 10,000 texts from 400 eminent Russian writers (prosaists, poets and critics), among them:
    Adamovich G. Abramov F., Aksenov V., Andreev L., Annenskij I., Anninskij L., Antokol’skij P., Aleshkovskij Y., Akhmadulina B., Akhmatova A., Aldanov M., Amfiteatrov A., Averchenko A.,Astaf’ev V., Aseev N., Ajtmatov C., Babel’ I., Bagritskij E., Bakhtin M., Bal’mont K., Bek A., Belov V., Belyj A., Berggol’ts O., Berberova N., Berkovskij N., Blok A., Bobrov S., Brodskij (Brodsky) I., Bryusov V., Bulgakov M., Bunin I., Burlyuk D., Bykov V., Bitov A., Chekhov A., Chukovskaya L., Chukovskij K., Chernyj Sasha, Daniel’ Yu., Dombrovskij Y., Dovlatov S., Efremov I., Ehrdman N., Ehrenburg I., Esenin S., Evtushenko E., Fadeev A., Fedin K., Forsh O., Galich A., Gazdanov G., Gajdar A., Gershenzon M., Gilyarovskij V., Ginzburg L., Gippius Z., Gorbanevskaya N., Gorenshtejn F., Gorkij M., Gorodetskij S., Granin D., Grebenshchikov B., Grigor’ev O., Grin A., Grossman V., Gumilev N., Il’f I., Iskander F., Ivanov G., Ivanov Vyach., Ivanov Vs., Kamenskij V., Kataev V., Kaverin V., Kazakevich E., Kazakov Y., Kharitonov Y., Kharms D., Khlebnikov V., Khodasevich V., Kim A., Klyuev N., Kononov H., Kopelev L., Korzhavin N., Kozhevnikov P., Krivulin V., Kruchenykh A., Kublanovskij Y., Kushner A., Kuzmin M., Leonov L., Limonov E., Lipkin S., Lipatov V., Lotman Y., Lozinskij M., Lugovskoj V., Lunts L., Makanin V., Makarenko A., Maksimov V., Mandel’shtam O., Marshak S., Mariengof A., Mikhalkov S., Morits Y., Mother Maria, Mayakovskij V., Merezhovskij D., Mezhirov A., Mejlakh M., Nabokov V., Nagibin Y., Narbut V., Nekrasov V., Nilus S., Nosov N., Novikov-Priboj A., Odoevtseva I., Olesha Y., Okhapkin O., Okudzhava B., Olejnikov N., Oseev N., Ostrovskij N., Panova V., Panteleev L., Pasternak B., Paustovskij K., Petrov Y., Pikul’ V., Pil’nyak B., Petrushevskaya L., Platonov A., Polevoj B., Popov V., Popov Y., Prigov D., Pristavkin A., Prishvin M., Pulatov T., Rasputin B., Radzinskij E., Remizov A., Roshchin M., Rozanov V., Rozhdestvenskij R., Rozov V., Rubtsov N., Rubinshtejn L., Rybakov A., Samojlov D., Sapgir G., Sevela E., Severyanin I., Sel’vinskij I., Semenov Y., Serafimovich A., Shaginyan M., Shalamov V., Shatrov M., Shvarts E., Shefner V., Shinkarev V., Shklovskij V., Shmelev I., Shukshin V., Sholokhov M., Simonov K., Slavkin V., Slutskij B., Sokolov Sasha, Sokolov-Mikitov I., Sologub F., Soloukhin V., Sorokin V., Solzhenitsyn A., Sosnora V., Strugatskij A.& B. , Svetlov M., Tarkovskij A., Teffi, Terts A. /Sinyavskij A./ Tikhonov N., Tolstoj A., Tolstoj L. Tolstaya T. Tryapkin N. Tsvetaeva M., Tynyanov Y. , Trenev K., Trifonov Y., Tvardovskij A., etc.
    The texts are encoded in HTML and SGML.The corpus is prepared for distribution on CD-ROM as an anthology, compiled by S. Yablonskij, with several dictionaries and presentations of authors and their works.
Russicon dictionaries for Russian language:
Total number of entries in the Russicon dictionaries for Russian language is about 200,000. All the dictionaries are implemented as source files (text format) and as compressed linguistic databases in the format specific for Russicon NLP-software. This format provides maximum efficiency of language processing and minimum memory requirements.
Formats of source files are: text, HTML and SGML. They contain normalised entry words (lemmas) with hyphenation and inflection paradigms plus grammatical tagging for each wordform. The set of Russian language tags consists of: part of speech, case, gender, number, tense, person, degree of comparison, voice, aspect, mood, form, type, transitivity, reflexivity, animation. Thesauri, explanatory dictionaries and reference guides are presented as two or more text–files, one always containing inflection paradigms of all words in the dictionary/guide.
  1. Russian Basic Grammatical Dictionary.
    The source file of the dictionary contains approximately 80,000 normalised headwords (lemmas) with hyphenation plus inflection paradigms, and expands to approximately 3,5 mln words. Of abbreviations and measuring units, only the most frequent are included into the dictionary.Entry lemmas produce 77,717 word changing stems (WCS – part of the word without ending) which are distributed by part of speech as follows:
    Nouns - more than 50%
    Adjectives – about 23%
    Verbs – about 23%
    Adverbs - about 2%
    Other parts of speech (interjections, prepositions, connectives, abbreviations, numerals, conjunctions, parentheses, pronouns, units of measure, modal words) contribute 0.1 to 0.3% each. WCS are counted only for normalised entry words. Participles and verbal adverb don’t form separate word changing stems because they are the verb forms and are generated in verb paradigm. In the compressed dictionary database, all WCS are divided into word building groups (WBG). A word building group includes word building stem (WBS – part of the word without ending and suffixes) and a set of triplets: S+IC+BBT, where S – suffix (number of the suffix – 1 byte), IC – inflection class (number of the inflection class – 1 byte), BBT – 1 byte of binary tags.
    Each triplet unambiguously determines WCS and its grammar characteristics. In order to increase the speed of morphological analysis and normalisation, WCS were generated for all the wordforms. This produced 179,289 WCS for the database. Also, 255 most frequent suffixes were coded. Other suffixes were included in WBS. Thus, WCS are distributed into 42,874 WBS in such a way that 12,101 (28,2%) WBS have only one corresponding WCS, others have 2 and more, e.g., there are 1109 WBS with 9 corresponding WCS, etc. The maximum number of WCS per 1 WBS is 49. The average number of WCS per 1 WBS is 4.18. The average length of WBS is 7 letters.
    The most frequent suffix is the empty suffix, it is used in 26 352 WCS. There are also other frequent suffixes, each of them is used in more than 5,000 WCS.The size of the compressed dictionary is 990 K.
  2. Dictionary of Russian given names, patronymics and surnames. 10,000 entry words. Russian given names, diminutives and patronymics, surnames of the world famous and Russian famous people.
  3. Dictionary of geographical names. 1,500 entries.
  4. Jargon dictionary. 5,000 entries
  5. Monolingual Russian dictionaries for domains:
Word list with inflection paradigms for normalised headwords (lemmas) plus grammar tags connected with the reference guide, plus reference guide and concise explanatory dictionary of linguistic terms.
  1. Orthographic dictionary.
    List of 60,000 words plus orthography and punctuation reference guide. A concise explanatory dictionary of linguistic terms is also included.
    Each normalised word is presented with its inflection paradigms plus grammar tags connecting to reference guide.
  2. Russicon Russian thesaurus
    8,696 synonymic groups, word list containing approximately 30,000 normalised words with hyphenation and inflection paradigms.
  3. Contemporary Russian explanatory and grammatical dictionary.
    This dictionary is conceived as a universal reference book of the Russian language of the end of the XX century. It contains more then 100,000 entries, including new words, idioms and their meanings from the language of the 1980's and the beginning of the 1990's. All dictionary information for entries is structured in more then 60 attributes: entry word; multiple word entries; usage notes; precise, contemporary definitions; derivations; example sentences/citations; idioms etc. Every word in the dictionary has its entry in Russicon Basic Grammatical Dictionary, with the inflection paradigm.
Russicon monolingual dictionaries for other Slavonic languages:
  1. Ukrainian grammatical dictionary.
    70,000 normalized entry words with hyphenation and inflection paradigms plus grammatical tags for each word of paradigm, approximately 2,700,000 words.
  2. Ukrainian thesaurus.
    6,111 synonymic groups plus word list containing 20,000 words.
  3. Ukrainian dictionary of personal names.
    2,000 personal names and diminutives.
  4. Czech grammatical dictionary.
    70,000 words.
  5. Czech thesaurus.
    5,900 synonymic groups plus word list containing 20,000 entries.