The website http://corpus.leeds.ac.uk/ was originally designed to host comparable English and Russian corpora, but in time we have accumulated a variety of large corpora supported by a uniform search interface: "Leeds CQP", which is a CGI Perl frontend to IMS Corpus Workbench. Tools developed to work with corpora are listed on a separate page.
Monolingual corpora
English
English Internet Corpus, a corpus of about 110 million words. This corpus has been compiled automatically from the Internet in 2005 along with other Internet corpora (for Chinese, French, German, Italian, Spanish, Polish and Russian).
the Reuters corpus, a collection of newswires from Reuters for one year from 1996-08-20 to 1997-08-19, 90 million words.
A corpus of British News, a collection of newsstories from 2004 from each of the four major British newspapers: Guardian/Observer, Independent, Telegraph and Times, 200 million words.
The Russian National Corpus, a collection of texts comparable to the BNC in its design, its pilot version has 50 million words (a more elaborated description of the project is available in Russian from "http://ruscorpora.ru)
Russian Internet Corpus, a corpus of about 90 million words. This corpus has been compiled automatically from the Internet in February-April 2005 along with other Internet corpora.
a corpus of Russian newspapers, 78 million words (Izvestia, Trud and Strana.ru).
the Russian Standard, a corpus of modern Russian fiction with manual disambiguation of morphological categories, 1.6 million words.
Chinese Internet Corpus, a corpus of about 90 million words. This corpus has been compiled automatically from the Internet in February-April 2005 along with other Internet corpora.
There are few large general corpora of the size of BNC (100 million words) available. Within Wacky (Web as Corpus) project we developed a set of procedures for collecting Internet corpora from the Internet and collected large representative corpora for for Chinese, French, German, Italian, Spanish, Polish and Russian with the search interface available from http://corpus.leeds.ac.uk/internet.html.
The query interface to all corpora is powered by the IMS Corpus Workbench, but it has been extended to simplify processing of some frequent cases, in particular, querying for lemmas and for exact word forms (all corpora have word, pos and lemma attributes, even if the latter is redundant for Chinese). Other possibilities include calculation of most significant collocations (using MI, T and loglikelihood scores) and searching for similar contexts in English, German and Russian corpora.
The interface was developed by Serge Sharoff; contact me at s.sharoffleeds.ac.uk, if you have further queries.
frequencies of personal names from the newspaper corpus. Note that lemmatisation produced by mystem is consistently wrong, either female names are produced for men (e.g. Putina, Xodorkovskaja) or verbs/adjectives are used, especially for non-Slavonic names (Saddam Husejnyj, Garry Pottiratq); take this into account when making quieres.
The structure of the lists follows the template of the lemmatised BNC lists produced by Adam Kilgariff, namely:
[word rank] [normalised frequency] [lemma, word form or POS]
Note that the frequency has been normalised to ipm: the number of instances of an individual word or POS tag per million words in respective corpora. Normalisation makes it possible to compare frequencies in the BNC against the Internet corpus. If you want to know the actual number of occurrences of a word listed there, multiply the frequency by the corpus size in million words (the size of a corpus is shown at the top of its frequency list). For instance, browser is used about 8556 times in the English Internet Corpus (47.17*181.376).