What is Data Cleaning?

Nealia Khan

Data cleaning is the process of detecting and correcting bad records in a data file. Much of the early Botswana Combination Prevention Project (BCPP) data came from a baseline household questionnaire. Research Assistants (RAs) interviewed thousands of study participants and recorded their answers on laptops. The use of customized software minimized data entry errors.

As data was collected, Nealia Khan, the BCPP Data Manager, cleaned it using the following steps:

Confirm that data exists and conforms to expectations.

Does the file contain information in the right format with the approximate number of records expected? For example, does the batch of data contain information from several hundred study participants from a pair of villages in Botswana?

Reconcile discrepancies in answers from study participants.

For various reasons, study participants don’t always tell the truth. They may misunderstand a question, misremember past events, or feel obliged to offer socially acceptable answers.

It’s not the RA’s job to question a participant’s memory or integrity. Well-designed questionnaires include redundancy to help verify responses. The same question may be asked several times in different ways. For example, if a participant says she has never been tested for HIV, yet has her own prescription for HIV drugs, then a “logic-check” would flag her answers.

Derive variables from raw data.

A variable is derived using questionnaire answers and other available information.

Take ART status—whether a participant with HIV is on antiretroviral treatment (ART). If a participant says he’s not on ART, but records at the local clinic show he once had a prescription for ART and a suppressed viral load, it’s possible he stopped taking his medication. The team investigates further to determine if he’s a treatment defaulter.

After the data has been cleaned, investigators begin their analysis.

Title photo of Nealia Khan by Lucia Ricci.