Research Guides: Machines and Society: Generative AI for Data-driven Research

Introduction

In this section, we explore various approaches of utilizing generative AI as research assistants for empirical and data-driven research. The tool has the potential to enhance academic performance in multiple stages of scientific research, including data collection, feature generation, exploratory data analysis, fine-tuning models, and literature synthesis. This can be done with caution in conjunction with human evaluation and interpretation.

The major incentive for integrating this tool into the research process is that it enables researchers to concentrate on the fundamental tasks by delegating supplementary responsibilities to generative AI.

However, when conducting research with the assistance of generative AI, it is crucial to evaluate the model's limitations and potential variability in performance. In the end, it is domain expertise, the ability to generate unique insights, and ethical considerations that will be essential in distinguishing research outcomes.

Coding Assistant for Data Analysis

One of the most apparent use cases in research is using generative AI as a coding assistant for data analysis across various stages in the data workflow. For instance, using generative AI to aid the researcher in understanding or replicating a study conducted in unfamiliar or new programming languages and/or even in a new domain.

Researchers can use the tool to

explain code in a step-by-step manner
translate between programming languages
ask "how to code" questions
generate sample code
optimize code for efficiency, simplicity, readability, and maintainability
debug code

Caution:
ChatGPT's performance in these cases depends on the level of support for the particular language or software package. Additionally, researchers should be cautious about using generative AI for anything beyond coding assistance. Even tasks as seemingly straightforward like data cleaning require making many small decisions based on clearly defined objectives. Relying on generative AI for more complex tasks like model selection without evaluating the outputs is also not recommended.

Natural Language Processing Tools

Researchers may leverage generative AI as a natural language processing tool for various tasks, including labeling topics, extracting entities, and assessing sentiments for given text data, among other tasks.

There are some experiments using ChatGPT for tasks with potential for feature generation, including

text annotation on relevance, stance, topics, and frames detection of tweets
thematic analysis of semi-structured interviews
coding open-text survey responses into analytic categories in public opinion research
identifying entities in historical documents through named entity recognition and classification on locations, persons, organizations, time periods, human works, and physical objects

Caution:
This technique may be helpful for researchers to obtain preliminary understanding of their data and documents at the initial stages of research. However, the generation process is not transparent and arguably not reproducible when the full technical details of the model are absent, making it challenging to make informed decisions along the way.

Exploratory Data Analysis

Similar to using generative AI as an NLP tool, one may use generative AI for conducting Exploratory Data Analysis (EDA). For instance, asking ChatGPT to read CSV files and then perform some preliminary tasks, such as describing, summarizing, and subsetting the dataset, before creating some plots. The current capacity of generative AI could be limited in terms of the size of files it can process.

Caution:
See the notes in the section on NLP Tool.

Fine-tuning for Downstream Tasks

Essentially, fine-tuning involves uploading one's training data, usually a smaller dataset, and training a new model that is optimized for a specific task or domain.

Examples

Kim, J., & Lee, B. (2023). AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction. https://doi.org/10.48550/arXiv.2305.09620

Simulations

Agents

Developing Research

Using generative AI to develop one's research is not restricted to data-driven or empirical research.

This could include asking ChatGPT to generate ideas for a research question, and leveraging its reading summarization ability to synthesize literature. To improve ChatGPT's results, researchers can formulate follow-up questions that expand on key points, correct any inaccuracies in ChatGPT's responses, refine previous outputs, and verify certain parts of the answers they receive.

As an early example, the paper referenced below tested ChatGPT's performance in four stages of research process, each with iterations: (1) idea generation; (2) literature review; (3) identification of suitable data for the research idea and preparing a data summary; and (4) creation of a testing framework based on the research idea and the proposed data. Read more about this process in the Appendices (PDF) of the paper below.

Dowling, M., & Lucey, B. (2023). ChatGPT for (Finance) Research: The Bananarama Conjecture. Finance Research Letters, 53, 103662. https://doi.org/10.1016/j.frl.2023.103662

Contact

Yun Dai
Data Services
yun.dai@nyu.edu