Skip to Main Content

Machines and Society

A growing guide on the latest in data-driven research and emerging technologies at the intersection of society, information and technology.

Introduction

In this section, we explore various approaches of utilizing ChatGPT as a research assistant for empirical and data-driven research. The tool has the potential to enhance academic performance in multiple stages of scientific research, including data collection, feature generation, exploratory data analysis, fine-tuning models, literature synthesis, and finding data sources. This can be done with caution in conjunction with human analysis and interpretation. 
 
The major incentive for integrating this tool into the research process is that it enables researchers to concentrate on the fundamental tasks by delegating supplementary responsibilities to ChatGPT. 

However, when conducting research with the assistance of ChatGPT, it is crucial to evaluate the model's limitations and potential variability in performance. In the end, it is domain expertise, the ability to generate unique insights, and ethical considerations that will be essential in distinguishing research outcomes.

Coding Assistant for Data Analysis

ChatGPT's most apparent use case in research is serving as a coding assistant for data analysis across various stages in the data workflow. One application is using ChatGPT to aid the researcher in understanding or replicating a study conducted in unfamiliar or new programming languages and/or even in a new domain.

Researchers can use the tool to

  • explain code in a step-by-step manner
  • translate between programming languages
  • ask "how to code" questions
  • generate sample code
  • optimize code for efficiency, simplicity, readability, and maintainability 
  • debug code

Caution: 
ChatGPT's performance in these cases depends on the level of support for the particular language or software package. According to OpenAI's code completion guide based on the Codex model, it is most capable in Python and proficient in languages including JavaScript, Go, Perl, PHP, Ruby, Swift, TypeScript, SQL, and Shell. The guide is now marked as "deprecated", but the idea of how it works is still relevant.  

Additionally, researchers should be cautious about using ChatGPT for anything beyond coding assistance. Even tasks as seemingly straightforward like data cleaning require making many small decisions based on clearly defined objectives. Relying on ChatGPT for more complex tasks like model selection without evaluating the outputs is also not recommended.

Natural Language Processing Tool for Feature Generation

Researchers may leverage ChatGPT as a natural language processing tool to generate features for given text data, such as entities and sentiments. 

There are some experiments using ChatGPT for tasks with potential for feature generation, including 

Caution:
This technique may be helpful for researchers to obtain preliminary understanding of their data and documents at the initial stages of research. However, the feature generation process is not transparent when the full technical details of the model (e.g., data collection and cleaning) are absent, making it challenging to make informed decisions along the way. 

Exploratory Data Analysis

Similar to using ChatGPT as an NLP tool, one may use ChatGPT for conducting Exploratory Data Analysis (EDA). For instance, asking ChatGPT to read CSV files and then perform some preliminary tasks, such as describing, summarizing, and subsetting the dataset, before creating some plots. But the capacity of ChatGPT seems limited at the moment, particularly in terms of the size of files it can process.

Caution:
See the notes in NLP Tool for Feature Generation. 

Fine-tuning for Downstream Tasks

GPT-3 models davinci, curie, babbage, and ada can be fine-tuned for downstream tasks. Most relevant for data-driven research is text classification, as illustrated in OpenAI's Guide: best practices for fine-tuning GPT-3 to classify text (Google Doc). 

Sample Python code is available in OpenAI Cookbook, but access to the models requires an OpenAI account and an associated API key.

Essentially, fine-tuning involves uploading one's training data, usually a smaller dataset, and training a new model that is optimized for a specific task or domain. 

Developing Research

Using ChatGPT to develop one's research is not restricted to data-driven or empirical research. This includes asking ChatGPT to generate ideas for a research question, and leveraging its reading summarization ability to synthesize literature. 

To improve ChatGPT's results, researchers can formulate follow-up questions that expand on key points, correct any inaccuracies in ChatGPT's responses, refine previous outputs, and verify certain parts of the answers they receive.

As an example, the paper referenced below tested ChatGPT's performance in four stages of research process, each with iterations: (1) idea generation; (2) literature review; (3) identification of suitable data for the research idea and preparing a data summary; and (4) creation of a testing framework based on the research idea and the proposed data. Read more about this process in the Appendices (PDF) of the paper below.

Dowling, M., & Lucey, B. (2023). ChatGPT for (Finance) Research: The Bananarama Conjecture. Finance Research Letters, 53, 103662. https://doi.org/10.1016/j.frl.2023.103662

Contact

Yun Dai
Data Services
yun.dai@nyu.edu