Skip to Main Content

Machines and Society

A growing guide on the latest in data-driven research and emerging technologies at the intersection of society, information and technology.

Introduction

In this section, we explore various approaches of utilizing generative AI as research assistants for empirical and data-driven research. The tool has the potential to enhance academic performance in multiple stages of scientific research, including data collection, feature generation, exploratory data analysis, fine-tuning models, and literature synthesis. This can be done with caution in conjunction with human evaluation and interpretation. 
 
The major incentive for integrating this tool into the research process is that it enables researchers to concentrate on the fundamental tasks by delegating supplementary responsibilities to generative AI. 

However, when conducting research with the assistance of generative AI, it is crucial to evaluate the model's limitations and potential variability in performance. In the end, it is domain expertise, the ability to generate unique insights, and ethical considerations that will be essential in distinguishing research outcomes.

Coding Assistant for Data Analysis

One of the most apparent use cases in research is using generative AI as a coding assistant for data analysis across various stages in the data workflow. For instance, using generative AI to aid the researcher in understanding or replicating a study conducted in unfamiliar or new programming languages and/or even in a new domain.

Researchers can use the tool to

  • explain code in a step-by-step manner
  • translate between programming languages
  • ask "how to code" questions
  • generate sample code
  • optimize code for efficiency, simplicity, readability, and maintainability 
  • debug code

Caution: 
ChatGPT's performance in these cases depends on the level of support for the particular language or software package. Additionally, researchers should be cautious about using generative AI for anything beyond coding assistance. Even tasks as seemingly straightforward like data cleaning require making many small decisions based on clearly defined objectives. Relying on generative AI for more complex tasks like model selection without evaluating the outputs is also not recommended.

Natural Language Processing Tools

Researchers may leverage generative AI as a natural language processing tool for various tasks, including labeling topics, extracting entities, and assessing sentiments for given text data, among other tasks. 

There are some experiments using ChatGPT for tasks with potential for feature generation, including 

Caution:
This technique may be helpful for researchers to obtain preliminary understanding of their data and documents at the initial stages of research. However, the generation process is not transparent and arguably not reproducible when the full technical details of the model are absent, making it challenging to make informed decisions along the way. 

Exploratory Data Analysis

Similar to using generative AI as an NLP tool, one may use generative AI for conducting Exploratory Data Analysis (EDA). For instance, asking ChatGPT to read CSV files and then perform some preliminary tasks, such as describing, summarizing, and subsetting the dataset, before creating some plots. The current capacity of generative AI could be limited in terms of the size of files it can process.

Caution:
See the notes in the section on NLP Tool. 

Fine-tuning for Downstream Tasks

Essentially, fine-tuning involves uploading one's training data, usually a smaller dataset, and training a new model that is optimized for a specific task or domain. 

Examples

Kim, J., & Lee, B. (2023). AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction. https://doi.org/10.48550/arXiv.2305.09620

Simulations

Agents

Developing Research

Using generative AI to develop one's research is not restricted to data-driven or empirical research.

This could include asking ChatGPT to generate ideas for a research question, and leveraging its reading summarization ability to synthesize literature. To improve ChatGPT's results, researchers can formulate follow-up questions that expand on key points, correct any inaccuracies in ChatGPT's responses, refine previous outputs, and verify certain parts of the answers they receive.

As an early example, the paper referenced below tested ChatGPT's performance in four stages of research process, each with iterations: (1) idea generation; (2) literature review; (3) identification of suitable data for the research idea and preparing a data summary; and (4) creation of a testing framework based on the research idea and the proposed data. Read more about this process in the Appendices (PDF) of the paper below.

Dowling, M., & Lucey, B. (2023). ChatGPT for (Finance) Research: The Bananarama Conjecture. Finance Research Letters, 53, 103662. https://doi.org/10.1016/j.frl.2023.103662

Contact

Yun Dai
Data Services
yun.dai@nyu.edu