NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
HathiTrust Digital Library is a digital preservation repository and highly functional access platform. It provides long-term preservation and access services for public domain and in copyright content from a variety of sources, including Google, the Internet Archive, Microsoft, and in-house partner institution initiatives. Bibliographic and full text search are available for all volumes in HathiTrust. Public domain volumes are freely accessible to the public and can be downloaded in their entirety with authentication by persons affiliated with partner institutions. NYU Libraries is a HathiTrust partner institution.
Documenting the American South (DocSouth) is a digital publishing initiative that provides Internet access to texts, images, and audio files related to southern history, literature, and culture. Currently DocSouth includes sixteen thematic collections of books, diaries, posters, artifacts, letters, oral history interviews, and songs.
BYU Corpora is a set of text corpora for linguistics and machine learning applications. The corpora at this site were created by Mark Davies, Professor of Linguistics at Brigham Young University. The corpora have many different uses, including: finding out how native speakers actually speak and write; finding the frequency of words, phrases, and collocates; looking at language variation and change; e.g. historical, dialects, and genres; gaining insight into culture; for example what is said about different concepts over time and in different countries; designing authentic language teaching materials and resources. This database contains access to the popular corpora Corpus of Contemporary American English (COCA) and Corpus of Historical American English (COHA), as well as News on the Web (NOW), Global Web-Based English (GloWbE), Wikipedia Corpus, and many others.
For best results, after accessing this resource, create a free personal account.
The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and government research laboratories. It was formed in 1992 to address the critical data shortage then facing language technology research and development. LDC creates and distributes a wide array of language resources, including materials used by those engaged in language-related education, research and technology development. Spanning data collections, corpora, software, research papers and specifications, these vital tools aid and inspire scientific progress. Please note that NYU does not currently own all available products available from LDC.
To access LDC data, you must be identified as a member of the NYU Community. To get started, go to the account homepage and create an account. Please be sure to enter "New York University" (without the quotes) into the "Organization" box and then select it as an organization. After you make your account, you will be contacted with next steps.
These datasets were generated in July 2012 (Version 2) and July 2009 (Version 1); we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20120701 and 20090715 for the current sets). Each of the numbered links below will directly download a fragment of the corpus. In Version 2 the ngrams are grouped alphabetically (languages with non-Latin scripts were transliterated); in Version 1 the ngrams are partitioned into files of equal size. In addition, for each corpus we provide a file named total_counts, which records the total number of 1-grams contained in the books that make up the corpus. This file is useful for computing the relative frequencies of ngrams.
This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.