Research Guides: Research Data Management: DATA PROCESSING

DATA PROCESSING

No matter what type of data that you'll work with, you will very likely have to transform it into a different format to do your exact analysis or visualization. Sometimes we can’t help the way that we receive data, either from a machine, from a public repository, or carefully hand-crafted surveys.

The first thing you should know when starting to process data is that raw data should be read-only. This means that you should never change the original data without making a new derivative file that reflects those changes.

The point of data processing is to make sure that our data becomes:

Consistent: you don’t want data that contradicts each other – such as two tables with the same survey respondent in each, but with different mailing addresses. You want to ideally have the right one mailing address in both tables. So we process data to make sure we're consistent.
Uniform: all the values in the dataset are the same units and expressed the same, e.g. age is expressed as whole numbers (15, 22, 47), addresses are expressed as lat and long, weight is in KG and not pounds.
Valid: the values in your dataset are within a specified range of responses, e.g. if you survey people between the ages of 20 - 60, you shouldn’t see anyone with their age as 18 or 61 in your dataset.
(Possibly) Unique: make sure that if you can’t have a repeating value in one column (like SSN), then it doesn’t actually repeat (you’d be surprised).
Complete: you have recorded all possible knowns and made provisions for the unknowns (e.g. designating a null value as NaN or 999).

Some common ways to process data include:

Parsing data to look for errors like “HoWARD STREET”, “Howard Street”, “Howard St”, and making them consistent.
Transforming data – such as the values of weight that are expressed in pounds into kg.
Deduping data – finding duplicate values or rows that should be merged, and merging them.

QUALITATIVE DATA

There aren’t these hard and fast rules for qualitative data preparation. It’s very much field and method/framework dependent. Some qualitative researchers leave out “um”, “ah”, “well…………” when they transcribe interviews. Others leave them in as a valuable part of the interview – looking at the sentiment and nature of the participant. In the case of OCR-ing images or PDFs into machine-readable text, it’s about checking the accuracy of the scan – are the long ‘F’ looking characters that used to be ‘s’ transformed into the modern character? Do you want them to be? These questions around data transformations are embedded in the framework under which qualitative researchers operate. So carefully consider your methodology while processing your qualitative data!

QUANTITATIVE & TIDY DATA

If you work with numerical data (especially in tabular formats like CSV or XLSX), you might also hear the phrase ‘tidy’ data. Hadley Wickham, who coined the phrase, defines the five most common problems that necessitate tidying in this paper called Tidy Data:

Column headers are values, not variable names; a table with colums ‘blue’, ‘brown’, or ‘green’ meant to represent eye color, and the rows contain either 0 (no) or 1 (yes). This should really be one column, “eye color”, with the color as the value.
Multiple variables are stored in one column; a table with age and sex combined in one column (e.g. m014, m1524). This should be two columns: sex and age.
Variables are stored in both rows and columns; a table with the column “measurement” that contains data like n_dogs, n_cats with a column ‘value’ next to it depicting the number of cats and dogs per person. This should really be at least two columns: n_dogs, n_cats, with numeric values showing the number of dogs and cats per person.
Multiple types of observational units are stored in the same table; a table that contains someone’s age and height alongside their pet’s name, type of pet, and pet’s age. While it’s cute, these should be two tables: one for people and one for pets.
A single observational unit stored in multiple tables; a new table of an individual’s medical history for each year of their life. This should really be one table with each row representing a year of someone’s life.

Here’s an example of a table that represents some messy data (from Dr. Wickham’s paper) which shows the number of folks from different religions in different tax bracket. We see that the columns have the tax bracket or income range, and the rows represent each religion. This table violates problem #3 -- variables are being stored in both the rows and columns.

Let's look at the the tidy version (also from Dr. Wickham’s paper). The columns are now no longer values, and are just headers describing what is in each cell -- which is tidy! There may be repeating values in religion, but since they correspond to different income levels, we take the two together and see there are no duplicates that way. And finally, the frequency is now its own column. This will make plotting a lot easier!

CC

Original work in this LibGuide is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Research Data Management

General Information

Service Desk and Chat

DATA PROCESSING

QUALITATIVE DATA

QUANTITATIVE & TIDY DATA

CC