Wesbury Lab Usenet Corpus: anonymized compilation of postings from 47,860 newsgroups that are english-language 2005-2010 (40 GB)
Wesbury Lab Wikipedia Corpus Snapshot of the many articles within the part that is english of Wikipedia that has been drawn in April 2010. It absolutely was prepared, as described in more detail below, to eliminate all links and unimportant product (navigation text, etc) The corpus is untagged, raw text. Utilized by Stanford NLP (1.8 GB).
: a corpus of manually-constructed description graphs, explanatory part ranks, and associated semistructured tablestore for some publicly available primary technology exam concerns in america (8 MB)
Wikipedia Extraction (WEX): a prepared dump of english language wikipedia (66 GB)
Wikipedia XML information: complete copy of all of the Wikimedia wikis, by means of wikitext supply and metadata embedded in XML. (500 GB)