Skip to Main Content

Text Data Mining

Resources for working with text as data, including corpus preparation, tutorials, data sources, and lists of tools.

Congressional Record Text-as-Data Collection

The ProQuest Congressional Record text-as-data collection consists of machine-readable files capturing the full text and a small number of metadata fields for a full run of the Congressional Record between 1789 and 2005. Metadata fields include the date of publication, subjects (for issues for which such information exists in the ProQuest system), and URLs linking the full text to the canonical online record for that issue on the ProQuest Congressional platform. A total of 31,952 issues are available.

The collection is restricted to members of the NYU community.

The data are arranged in JSON format, with each file encompassing a single issue. The files are split into three parts:

  • Part A: years 1789 to 1997
  • Part B: years 1998 to 2001
  • Part C: years 2002 to 2005

Note that this collection is not updated with the latest issues. Complete access to the full text Congressional Record that includes the most recent issues can be made using ProQuest TDM Studio. An intention to use the Congressional Record when using TDM Studio should be included as a special request in the message to data.services@nyu.edu seeking access to that interface. Please note that it takes longer to initiate access to the Congressional Record via TDM Studio than other assets provided in that service.

The PDF files, one for each issue, that provided the extracted text present in the JSON are available on request. Contact data.services@nyu.edu to request access.

The data can be accessed through two options:

Research Workspace

NYU researchers with a valid netID can mount the cloud-based ds_collections Research Workspace share on any local computer on the NYU network (i.e. on campus or on NYU VPN if off campus). Follow the instructions for how to access Research Workspace, using ds_collections as the project name. The collection will be found at ds_collections/proquest/proquest_congressional_record. A README file with further information about using the files is available there.

NYU High Performance Computing

The files can be found and used directly for batch jobs on the NYU HPC at /scratch/work/public/proquest/proquest_congressional_record. A README file with further information about using the files is available there. To request an HPC account (faculty sponsorship is required), visit the HPC homepage.

Questions?

Contact libraries-tdm@nyu.edu for questions about this data source.