Skip to Main Content

Text Data Mining

Resources for working with text as data, including corpus preparation, tutorials, data sources, and lists of tools.

English-Corpora Files

In addition to the English-Corpora web interface, NYU users have access to the entire raw file archives for computational analysis. The corpora in this package include: iWeb, the Intelligent Web Corpus; NOW, News on the Web; Coronavirus Corpus; COCA: Corpus of Contemporary American English; GloWbE, Global Web-based English; Wikipedia Corpus; COHA: Corpus of Historical American English; TV Corpus; Movies Corpus, SOAP Corpus, and the Corpus del Español and Corpus do Português. More information on the corpora is at https://www.corpusdata.org/corpora.asp

The data can be accessed and downloaded from Research Workspace. Further documentation about the collection is available on its home record on the UltraViolet research repository at https://doi.org/10.58153/c927w-hjr36.

Research Workspace Access Instructions

NYU researchers with a valid netID can mount the cloud-based ds_collections Research Workspace share on any local computer on the NYU network (i.e. on campus or on NYU VPN if off campus). Follow the instructions for how to access Research Workspace, using ds_collections as the project name. The English-Corpora collection will be found at ds_collections/english-corpora

Questions?

Contact data.services@nyu.edu for questions about this data source.