Research Guides: Text Data Mining: English-Corpora Files

English-Corpora Files

In addition to the English-Corpora web interface, NYU users have access to the entire raw file archives for computational analysis. The corpora in this package include: iWeb, the Intelligent Web Corpus; NOW, News on the Web; Coronavirus Corpus; COCA: Corpus of Contemporary American English; GloWbE, Global Web-based English; Wikipedia Corpus; COHA: Corpus of Historical American English; TV Corpus; Movies Corpus, SOAP Corpus, and the Corpus del Español and Corpus do Português. More information on the corpora is on their website.

The data can be accessed and downloaded from Research Workspace. Further documentation about the collection is available on its home record on the UltraViolet research repository.

Research Workspace Access Instructions

NYU researchers with a valid netID can mount the cloud-based ds_collections Research Workspace share on any local computer on the NYU network (i.e. on campus or on NYU VPN if off campus). Follow the instructions for how to access Research Workspace, using ds_collections as the project name. The English-Corpora collection will be found at ds_collections/english-corpora.

CC

Original work in this LibGuide is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Text Data Mining

General Information

Service Desk and Chat

English-Corpora Files

Research Workspace Access Instructions

CC