Research Guides: Text Data Mining: Avocado Research Email Collection

Avocado Research Email Collection

The Avocado Research Email Collection consists of emails and attachments taken from 279 accounts of a defunct information technology company referred to as "Avocado". Most of the accounts are those of Avocado employees; the remainder represent shared accounts such as "Leads", or system accounts such as "Conference Room Upper Canada". You must agree to the terms of use before gaining access to the data.

The collection consists of the processed personal folders of these accounts with metadata describing folder structure, email characteristics and contacts, among others. It is expected to be useful for social network analysis, e-discovery and related fields. See the public README for more information.

The source data for the collection consisted of Personal Storage Table (PST) files for 282 accounts. A PST file is used by MS Outlook to store emails, calendar entries, contact details, and related information. Data was extracted from the PST files using libpst version 0.6.54. Three files produced no output and and are not included in the collection. Each account is referred to as a "custodian" although some of the accounts do not correspond to humans.

The collection is divided into metadata and text. The metadata is represented in XML, with a single top-level XML file listing the custodians, and then one XML file per custodian listing all items extracted from that custodian's PST files. The full XML tree can be read by loading the top-level file with an XML parser that handles directives. All XML metadata files are encoded in UTF-8. The text contains the extracted text of the items in the custodians' folders, with the extracted text for each item being held in a separate file. The text files are then zipped into a zip file per custodian.

Access Instructions

Acess Restrictions

Different folks at NYU have different access to the AREC Data. Refer to the chart chart to understand to what you have access!

Table of permissions for AREC data access at NYU.
Primary status at NYU	Full raw data
Undergraduate student	❌
Undergraduate student - completing a capstone or thesis	✅
Graduate student	✅
Staff	✅
Faculty	✅
Visiting scholar/scholar-in-residence	✅
Non-NYU collaborator	❌

Access Instructions

To get access to the AREC data, please follow these steps:

Write a data management plan using our template.
Agree to the terms of use for AREC data.
Wait for a reply from a member of Data Services with further instructions on how to access the data.

CC

Original work in this LibGuide is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Text Data Mining

General Information

Service Desk and Chat

Avocado Research Email Collection

Access Instructions

Acess Restrictions

Access Instructions

CC