Skip to Main Content

Text Data Mining

Resources for working with text as data, including corpus preparation, tutorials, data sources, and lists of tools.

ProQuest Historical Newspapers Text-as-Data Collection

The ProQuest Historical Newspapers text-as-data collection consists of approximately 99.1 million XML files capturing the full text and a number of metadata fields for 26 newspaper titles found in the ProQuest historical newspapers database collection. The titles and years of coverage found in this collection are:

  • American Hebrew & Jewish Messenger, 1857-1922
  • American Israelite, 1854-1927
  • Atlanta Daily World, 1931-2003
  • Atlanta Constitution, 1868-1933
  • Baltimore Afro-American, 1893-1988
  • Boston Globe, 1872-1985
  • Chicago Tribune, 1849-1933
  • Christian Science Monitor, 1908-1994
  • Guardian and the Observer, 1791-1909
  • Hartford Courant, 1764-1931
  • Irish Times and Weekly Irish Times, 1859-1930
  • Jewish Advocate, 1905-1990
  • Jewish Exponent, 1887-1990
  • Norfolk Journal & Guide, 1921-2003
  • Los Angeles Times, 1881-1933
  • New York Amsterdam News, 1922-1993
  • New York Times with Index, 1851-1937
  • New York Tribune/Herald Tribune, 1841-1962
  • Newsday, 1940-1989
  • Philadelphia Tribune, 1912-2001
  • Pittsburgh Courier, 1911-2002
  • San Francisco Chronicle, 1865-1922
  • Times of India, 1838-2008
  • Wall Street Journal, 1889-1936
  • Washington Post, 1877-1935
  • ProQuest's Chinese Newspaper Collection, 1832-1953

Note that this collection is not updated and additional titles or newer issues are not being added at this time. For a more complete access to the full text of ProQuest's newspaper holdings, consider using ProQuest TDM Studio access.

The collection is available to members of the NYU community only.

Each XML file consists of the metadata, identifiers, and full text of one article within a single newspaper issue. The files are arranged on the access points below by newspaper title, and within each newspaper directory, in multiple zipped files (.zip format). Each compressed file contains multiple XML files. A README file is included with the collection with further details about the file structure.

There are two access points:

Research Workspace

NYU researchers with a valid netID can mount the cloud-based ds_collections Research Workspace share on any local computer on the NYU network (i.e. on campus or on NYU VPN if off campus). Follow the instructions for how to access Research Workspace, using ds_collections as the project name. The collection will be found at ds_collections/proquest/proquest_hnp.

NYU High Performance Computing

The files can be found and used directly for batch jobs on the NYU HPC at /scratch/work/public/proquest/proquest_hnp. To request an HPC account (faculty sponsorship is required), visit the HPC homepage.


Contact for questions about this data source.


About the data
You can preview the data quality and availability via the dashboard below. OCR quality was scored on a per-article basis, based on the percentage of the tokens in an article that are in the large curated corpus of attested words found in the HathiTrust digital library.

Additional documentation related to this collection is available on the guide prepared by ProQuest.