Leakage of data in research can result in serious losses for research subjects, sponsors, and investigators. For example, leakage of subject identity data can expose research subjects to the risk of identity theft, embarrassment, and even physical and mental harm. This can reduce the integrity of the investigators and make it difficult to conduct other research in the future. In the event of data leakage, investigators must report the incident to the Research Ethics Committee, and report to all individuals whose data was leaked.
Some of the causes of data leaks are as follows.
- Loss of research tools/devices that store confidential data.
- Exchange of data information via e-mail.
- Storage via cloud services such as Box, Drop Box, Google Drive, without additional encryption.
- Incomplete deletion of data from the device.
For this reason, every tool/device and communication lines used to store/transmit data must be protected. Here are some ways that investigators can do to protect research data.
Using Data Deidentification
Each research data contains Personally Identifiable Information (PII), which is information or a combination of information that can be used to identify a particular individual (e.g. name, ID card number). It would be dangerous if research data could be directly linked to the research subject’s PII. For this reason, investigators must separate PII from all relevant research data for analysis. Alternatively, the investigator can use a randomly selected study ID to separate the individual’s personal identity from the data used for analysis. The research ID can be created using software such as STATA, R, Microsoft Excel by ensuring the uniqueness of each ID. Here are some things to avoid in creating a research ID to avoid the data being re-identifiable.
- Using the characteristics of research data (e.g. randomized KTP numbers).
- Sort ID by alphabetical order of research subject names.
After going through the de-identification process, the two datasets (dataset containing PII and dataset for analysis) may not be combined except in cases where it is required. Datasets containing PII are stored in encrypted storage and protected from viruses.
In addition to electronic data, physical data such as survey files containing the identity of the subject must also be protected. Investigators should consider separating PII from other analytical data when designing surveys. One alternative that can be done is to put personal information on the survey cover sheet. After the survey is completed and the identification is carried out, the cover sheet is stored in a separate place from the other analysis data.
Data Storage Encryption
Encryption is the conversion of data into a code that requires a series of passwords or keys to open it. Some computer operating systems already have their own encryption software. However, investigators may also consider third-party encryption services (AES, Blowfish, etc.). Investigators are advised to encrypt data at several levels as follows.
- Encryption of the whole device. Encrypting the entire device can protect all information on the device, including the operating system. Password is required every time you unlock the device. This can protect temporary files which are duplicates of the original files that are often not aware of their existence.
- Encryption of cloud services used. Some cloud services such as Dropbox, Box, and Google Drive provide encryption services for the files that we store. To use it, cloud users must enable the encryption feature. Without activating this feature, the data stored in the cloud service is not encrypted.
- Folder encryption. Encrypting a folder or file before uploading it to a cloud service can minimize the occurrence of data leaks during upload. At the folder level, the encryption feature protects certain folders requiring the user to use a password when opening. One of the most widely used open-source folder encryption services is VeraCrypt.
KEP LPEM FEB UI specifically recommends investigators to perform encryption at least at the level of folders that store research data, especially for research data containing personal identity information of research subjects.
- File encryption. Encryption at the file level protects certain files until it leaves the folder or device on which they are stored. Encryption at the file level requires users to use a password when sending, receiving, and opening files.
Data storage recommendations can be briefly summarized in the following table.
Data Type | Storage Recommendations |
Raw data or data with PII | Encrypted folder on cloud storage/data server specifically for research projects |
Data that has gone through the de-identification process | Normal folders, but device still needs to be password protected |
Physical data (paper questionnaire) | Safety box/cabinet |
Protect Data Transfer
Data that has been protected by encryption during storage, is not necessarily protected when the data is distributed. To reduce the threat of data leaks during transmission, here are some things you can do:
- Avoid sending by e-mail. If urgent, make sure files are encrypted and if possible use the encryption feature when sending e-mails.
- Ensures that every file stored in the cloud is pre-encrypted. Uploading an unencrypted file (even if it’s deleted right after) can be dangerous because it leaves a trail.
- Avoid repeated distribution processes. If more than 1 person has to access data, storing it in a secure cloud service is less risky than sending files to different people over and over again.
In addition to electronic data distribution, physical information such as survey files must also be transferred with care. For example, using a locked suitcase and using a private vehicle.
When using data with more than one person, it will be very helpful if the research team makes a protocol for sending and using data between research members. This is to ensure that each member understands the steps for securing research data and avoiding leakage.
Using Keywords (Password)
Even if the file is protected with a password, there is still the possibility that the password used can be cracked. Here are some steps you can take to reduce this risk:
- Using different keywords for each device used (cloud, laptop, mobile, etc.)
- The more complex the keyword the harder it will be to hack
- Use a combination of letters, numbers, symbols, uppercase and lowercase letters
- Avoid using personal identification (name, date of birth, place of residence, institution, etc.) as keywords
- Avoid using repeated words
However, making sure keywords are easy for investigators to remember is just as important as making sure keywords are hard to crack. Some encryption software does not provide a forgotten password feature. If this happens, research data can be deleted. To anticipate this problem, investigators are advised to use a password manager application such as LastPass. Password manager apps like these can help investigators create complex random passwords and store them in encrypted investigators’ accounts.
Avoid Data Loss
To protect data from the risk of loss, investigators should consider backup data stored in a separate place. Data backups can be stored either via the cloud or the research institution’s servers. Software for backing up files such as SyncBack can also be used.
In addition, investigators should periodically use an antivirus program to avoid data corruption.
Data Deletion
To ensure that all data has been deleted, even from the bin, investigators are advised to use third-party software that provides data erasure services (eg Eraser, WipeDrive). Physical data must also be deleted when it is no longer needed. Using a cross-cut shredder is preferable to using a strip-cut shredder as it is more difficult to identify. The research team should also consider the appropriate time to remove the research subject’s PII.
Informing the Data Security Protocol to KEP
In applying for ethics approval, the research team is required to provide data security protocol information. The information should at least contain the following:
- Data properties
-
- Electronic (recording, video, etc.) or physical (survey paper, samples, etc.)
- Does the data contain PII?
- Is PII separated from analysis data? When was PII separated? Who has access?
- Data storage
-
- Where is the data stored?
- How is stored data secured?
- What software is used to store data?
- Who has access to the stored data?
- Data transfer
-
- How is data distributed? Electronically or physically?
- To whom will the data be distributed?
- How to secure data when distributed?
- Is there a non-disclosure agreement (NDA) for data distributed with outside parties?
- Data deletion
-
- When is data deleted?
- How is data deleted?
- What software is used?
Download this guide: Data Security Protocol Guide