Wesbury Lab Usenet Corpus: anonymized compilation of postings from 47,860 newsgroups that are english-language 2005-2010 (40 GB)
Wesbury Lab Wikipedia Corpus Snapshot of the many articles within the part that is english of Wikipedia that has been drawn in April 2010. It absolutely was prepared, as described in more detail below, to eliminate all links and unimportant product (navigation text, etc) The corpus is untagged, raw text. Utilized by Stanford NLP (1.8 GB).
: a corpus of manually-constructed description graphs, explanatory part ranks, and associated semistructured tablestore for some publicly available primary technology exam concerns in america (8 MB)
Wikipedia Extraction (WEX): a prepared dump of english language wikipedia (66 GB)
Wikipedia XML information: complete copy of all of the Wikimedia wikis, by means of wikitext supply and metadata embedded in XML. (500 GB)
Yahoo! Responses Comprehensive Questions and Answers: Yahoo! Responses corpus at the time of 10/25/2007. Contains 4,483,032 concerns and their responses. (3.6 GB)
Yahoo! Responses comprising concerns expected in French: Subset regarding the Yahoo! Responses corpus from 2006 to 2015 composed of 1.7 million concerns posed in French, and their matching responses. (3.8 GB)
Yahoo! Responses Manner issues: subset of this Yahoo! Answers corpus from a 10/25/2007 dump, chosen with their linguistic properties. Contains 142,627 questions and their responses. (104 MB)
Yahoo! HTML Forms removed from Publicly Available Webpages: contains a tiny test of pages which contain complex HTML kinds, contains 2.67 million complex kinds. (50+ GB)
Yahoo N-Gram Representations: This dataset contains n-gram representations. The info may act as a testbed for question task that is rewriting an universal problem in IR research along with to term and phrase similarity task, that will be typical in NLP research. (2.6 GB)
Yahoo! N-Grams, variation 2.0: n-grams (letter = 1 to 5), removed from a corpus of 14.6 million papers (126 million sentences that are unique 3.4 billion running terms) crawled from over 12000 news-oriented web web sites (12 GB)
Yahoo! Re Re Search Logs with Relevance Judgments: Annonymized Yahoo! Re Re Re Search Logs with Relevance Judgments (1.3 GB)
Yahoo! Semantically snapshot that is annotated of English Wikipedia: English Wikipedia dated from 2006-11-04 prepared with a quantity of publicly-available NLP tools. 1,490,688 entries. (6 GB)
Yelp: including restaurant ratings and 2.2M reviews (on demand)
Youtube: 1.7 million youtube videos information (torrent)
- Awesome datasets/NLP that are publicincludes more listings)
- AWS Public Datasets
- CrowdFlower: information for all (a lot of small studies they carried out and information acquired by crowdsourcing for the task that is specific
- Kaggle 1, 2 (be sure though that the kaggle competition data can be utilized outside the competition! )
- Open Library
- Quora (primarily annotated corpora)
- /r/datasets (endless set of datasets, many is scraped by amateurs though rather than precisely documented or certified)
- Rs.io (another big list)
- Stackexchange: Opendata
- Stanford NLP team (primarily annotated corpora and TreeBanks or real tools that are NLP
- Yahoo! Webscope (also contains papers which use the info this is certainly supplied)
- SaudiNewsNet: 31,030 Arabic paper articles alongwith metadata, removed from different online Saudi magazines. (2 MB)
- Assortment of Urdu Datasets for POS https://hotrussiangirls.net, NER and NLP tasks.
German speeches that are political: assortment of present speeches held by top German representatives (25 MB, 11 MTokens)
NEGRA: A Syntactically Annotated Corpus of German Newspaper Texts. Designed for free for several Universities and organizations that are non-profit. Need certainly to signal and send type to have. (on request)
Ten Thousand German News Articles Dataset: 10273 german language news articles categorized into nine classes for subject category. (26.1 MB)
100k German Court choices: Open Legal Data releases a dataset of 100,000 German court decisions and 444,000 citations (772 MB)
- © 2020 GitHub, Inc.
- Contact GitHub
That action can’t be performed by you at this time around.
You finalized in with another tab or screen. Reload to refresh your session. You finalized call at another tab or screen. Reload to recharge your session.