[Courtesy of Pixabay](https://pixabay.com/de/photos/datenschutz-datenschutzerkl%C3%A4rung-5243225/) Courtesy of Pixabay

Data protection

In psychology, data protection plays a particularly important role, since data is usually collected from humans (in other cases it is data from simulation studies or animal studies). In this article you will learn how to ensure data protection for your projects on LIFOS and what you have to consider for this.

  1. Data protection – what, why, and how
  2. What are personal, pseudonymous and anonymous data?
  3. Methods of anonymization
  4. Checklist
  5. What to do in case of uncertainty?

1. Data protection – what, why, and how

The General Data Protection Regulation (GDPR or in German: Datenschutz-Grundverordnung (DSGVO)) is a regulation of the European Union and contains provisions and rules on how personal data may be processed. Personal data are “(…) any information relating to an identified or identifiable individual (…)” (Art. 4 para. 1 DSGVO), whereby an individual is considered identifiable if they “can be identified directly or indirectly (…)” (ibid.). A person can be identified “in particular by means of: association with an identifier such as a name, an identification number, location data, an online identifier, or by association with one or more factors specific to (…) the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person” (ibid.).

The aim of the GDPR is to preserve the informational self-determination of a person. Particularly in research with human subjects, it is ethically imperative to ensure that participants give their consent and participate voluntarily and in an informed manner. This includes providing information about how the data will be collected (e.g., pseudonymized or anonymized) and what will happen to the data after the study ends (e.g., will the data be deleted after 10 years or will the data be shared publicly?). Important: If the subjects were not informed in the informed consent form that their data will be shared publicly (anonymized) with other scientists after the study, then it is not allowed!


2. What are personal, pseudonymous and anonymous data?

As we have just learned, in psychological research all data (especially in combination) are considered personal data and are therefore identifiable in case of doubt. In this context, hat are personal, pseudonymous and anonymous data?

Personal data: This is data that can be clearly assigned to a person and that occurs in particular in combination, sometimes uniquely, such as first name and surname, date of birth, place of birth, place of residence, telephone number, etc. This also includes health-related data. In this context, the Health Insurance Portability and Accountability Act in the USA has listed 18 so-called identifiers that count as personal data (including the above-mentioned and, in addition, geographical data (e.g. zip code), important data (e.g. date of school graduation), email addresses, social security number, tax identification number, IP addresses, medical records, etc.).

Pseudonymous data: In psychology, we often collect pseudonymized data by, for example, asking participants to generate an individual code (also known as a participant code). This code is stored separately from the participants’ names along with the actual data, referred to as a decoding list or blinding list. Pseudonymization ensures that the data can be attributed back to individual participants and re-identified using the decoding list, for purposes such as deleting a participant’s data upon request. Note: The decoding list, as well as the participant-generated codes, should never be shared publicly!

Question: Why is not allowed to share the participant code publicly? Imagine: My partner and I are both taking part in a study. Because data is collected over multiple time points, each participant must generate an individual code. Even if the instructions for generating the participant code are not shared, I know how my code was generated. Moreover, I know my partner well enough that I would be equally capable of generating his participant code. Accordingly, I can also re-identify him in the list. Ergo, anonymity is not guaranteed! If the participant code is not shared, it is harder for me to re-identify my partner or ideally not possible at all.

 

Anonymous data: The requirement for anonymous data is that it is impossible to assign the data to the subjects. Attention: As soon as even a single person can be identified in the data set, the data are no longer anonymous!

Question: How can a data set collected anonymously not be anonymous? Imagine: In your study cohort, a survey will be conducted on the Big 5 personality traits. In addition, your gender and age will also be recorded. The researchers conducting the survey will not receive any information about who completes the survey other than the information just listed. Now, in your cohort, however, there are not only females and males between the ages of 18 and 25, but there may also be a non-binary person, a woman aged 41 and a man aged 50. Once you know these students, you will also recognize them in the data set and the data will no longer be considered anonymous. It would be different if the survey had been run in the whole of Germany, for example. Age and gender are then no longer necessarily sufficient to identify the individuals. What is clear from this example is that potentially any characteristic or combination of characteristics could make a person identifiable.

 

Increased risk: The following circumstances may lead to a higher risk of re-identification:

  • Small data sets with only a very small number of subjects (e.g., all persons in Germany who survived a plane crash)
  • Data sets with a very specific sample (e.g., first semester psychology students who are already parents)
  • Variables with unique trait characteristics (e.g., individual biographical details, or one person is older than the others by a wide margin)
  • Inherently rare values (e.g., rare diseases)
  • Dyadic data (e.g., twin studies, partnership studies)

3. Methods of anonymization

Most of the data we collect in psychology are not immediately anonymous. It often requires changes to the data set to achieve anonymization. Several approaches are explained below.

  1. Removal of uniquely personal variables (such as names, email addresses, matriculation numbers, and also participant codes)
  2. Removal of variables that are not needed for reproducibility of results (according to the principle of data economy. Removed can be e.g.: Mmta data such as login time, server log data, browser and OS version, and also free text fields)
  3. Pay attention to unique feature characteristics (e.g. gender, age, …) and remove variables if necessary
  4. K-anonymize data (K-anonymity, so that no combination of characteristic values is unique in the data set, but occurs more often and thus does not allow conclusions about individual persons. K-anonymity can be achieved by various means, such as binning: instead of continuous ages 21, 23, and 24, one specifies ordinal categories: between 20 and 25 years, making three individual values into one common one)

Anonymization techniques always also lead to a loss of information, but are of immense importance in order to preserve anonymity and the protection of participants!


4. Checklist

Here is a checklist of questions to help you verify that your dataset has been anonymized enough to share on LIFOS.

  • Have all personal data been removed from the data set?
  • Have all irrelevant variables (e.g., not relevant for evaluation or automatically collected) been removed from the data set?
  • Are there unique characteristic values on variables or combinations of several variables that could lead to re-identification of a participant?
  • Does the data set have structures that lead to an increased risk of identification (see 2.) and would it be better to share the data set only upon request (“Data available upon request”) to protect the participant?
  • Could a participant be harmed by being recognized in the data set (e.g., by blackmail involving disclosure of sensitive data)?

Note: For LIFOS, we recommend removing age and gender entirely from data sets if they are not the topic of the study.


5. What to do in case of uncertainty?

If you are unsure whether your data set can be shared on LIFOS, please contact your supervisor and please ask your supervisor to approve your data set before uploading it. In case of further questions, uncertainties or comments you can also contact us at LIFOS[at]uni-frankfurt[punkt]de.