No matter what type of data that you'll work with, you will very likely have to transform it into a different format to do your exact analysis or visualization. Sometimes we can’t help the way that we receive data, either from a machine, from a public repository, or carefully hand-crafted surveys.
The first thing you should know when starting to process data is that raw data should be read-only. This means that you should never change the original data without making a new derivative file that reflects those changes.
The point of data processing is to make sure that our data becomes:
Some common ways to process data include:
There aren’t these hard and fast rules for qualitative data preparation. It’s very much field and method/framework dependent. Some qualitative researchers leave out “um”, “ah”, “well…………” when they transcribe interviews. Others leave them in as a valuable part of the interview – looking at the sentiment and nature of the participant. In the case of OCR-ing images or PDFs into machine-readable text, it’s about checking the accuracy of the scan – are the long ‘F’ looking characters that used to be ‘s’ transformed into the modern character? Do you want them to be? These questions around data transformations are embedded in the framework under which qualitative researchers operate. So carefully consider your methodology while processing your qualitative data!
If you work with numerical data (especially in tabular formats like CSV or XLSX), you might also hear the phrase ‘tidy’ data. Hadley Wickham, who coined the phrase, defines the five most common problems that necessitate tidying in this paper called Tidy Data:
Column headers are values, not variable names; a table with colums ‘blue’, ‘brown’, or ‘green’ meant to represent eye color, and the rows contain either 0 (no) or 1 (yes). This should really be one column, “eye color”, with the color as the value.
Multiple variables are stored in one column; a table with age and sex combined in one column (e.g. m014, m1524). This should be two columns: sex and age.
Variables are stored in both rows and columns; a table with the column “measurement” that contains data like n_dogs, n_cats with a column ‘value’ next to it depicting the number of cats and dogs per person. This should really be at least two columns: n_dogs, n_cats, with numeric values showing the number of dogs and cats per person.
Multiple types of observational units are stored in the same table; a table that contains someone’s age and height alongside their pet’s name, type of pet, and pet’s age. While it’s cute, these should be two tables: one for people and one for pets.
A single observational unit stored in multiple tables; a new table of an individual’s medical history for each year of their life. This should really be one table with each row representing a year of someone’s life.
Here’s an example of a table that represents some messy data (from Dr. Wickham’s paper) which shows the number of folks from different religions in different tax bracket. We see that the columns have the tax bracket or income range, and the rows represent each religion. This table violates problem #3 -- variables are being stored in both the rows and columns.
Let's look at the the tidy version (also from Dr. Wickham’s paper). The columns are now no longer values, and are just headers describing what is in each cell -- which is tidy! There may be repeating values in religion, but since they correspond to different income levels, we take the two together and see there are no duplicates that way. And finally, the frequency is now its own column. This will make plotting a lot easier!