Project Overview
Trends toward open science practices, along with advances in technology, have promoted increased data archiving in recent years, thus bringing new attention to the reuse of archived qualitative data. Qualitative data reuse can increase efficiency and reduce the burden on research subjects, since new studies can be conducted without collecting new data. Qualitative data reuse also supports larger-scale, longitudinal research by combining datasets to analyze more participants. At the same time, qualitative research data can increasingly be collected from online sources. Social scientists can access and analyze personal narratives and social interactions through social media such as blogs, vlogs, online forums, and posts and interactions from social networking sites like Facebook and Twitter. These big social data have been celebrated as an unprecedented source of data analytics, able to produce insights about human behavior on a massive scale. However, both types of research also present key epistemological, ethical, and legal issues. This study explores the issues of context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership, with a focus on data curation strategies. The research suggests that connecting qualitative researchers, big social researchers, and curators can enhance responsible practices for qualitative data reuse and big social research.
This study addressed the following research questions:
RQ1: How is big social data curation similar to and different from qualitative data curation?
RQ1a: How are epistemological, ethical, and legal issues different or similar for qualitative data reuse and big social research?
RQ1b: How can data curation practices such as metadata and archiving support and resolve some of these epistemological and ethical issues?
RQ2: What are the implications of these similarities and differences for big social data curation and qualitative data curation, and what can we learn from combining these two conversations?
Data Description and Collection Overview
The data in this study was collected using semi-structured interviews that centered around specific incidents of qualitative data archiving or reuse, big social research, or data curation. The participants for the interviews were therefore drawn from three categories: researchers who have used big social data, qualitative researchers who have published or reused qualitative data, and data curators who have worked with one or both types of data. Six key issues were identified in a literature review, and were then used to structure three interview guides for the semi-structured interviews. The six issues are context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership.
Participants were limited to those working in the United States. Ten participants from each of the three target populations—big social researchers, qualitative researchers who had published or reused data, and data curators were interviewed. The interviews were conducted between March 11 and October 6, 2021. When scheduling the interviews, participants received an email asking them to identify a critical incident prior to the interview. The “incident” in critical incident interviewing technique is a specific example that focuses a participant’s answers to the interview questions.
The participants were asked their permission to have the interviews recorded, which was completed using the built-in recording technology of Zoom videoconferencing software. The author also took notes during the interviews. Otter.ai speech-to-text software was used to create initial transcriptions of the interview recordings. A hired undergraduate student hand-edited the transcripts for accuracy. The transcripts were manually de-identified.
The author analyzed the interview transcripts using a qualitative content analysis approach. This involved using a combination of inductive and deductive coding approaches. After reviewing the research questions, the author used NVivo software to identify chunks of text in the interview transcripts that represented key themes of the research. Because the interviews were structured around each of the six key issues that had been identified in the literature review, the author deductively created a parent code for each of the six key issues. These parent codes were context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership. The author then used inductive coding to create sub-codes beneath each of the parent codes for these key issues.
Selection and Organization of Shared Data
The data files consist of 28 of the interview transcripts themselves – transcripts from Big Science Researchers (BSR), Data Curators (DC), and Qualitative Researchers (QR) respectively. Two participants declined having their data shared. Additional data files include the redacted interview analysis, and participant summaries (Memos). The documentation files include the individual interview guides for each of the participant categories, two versions of the consent form, a codebook, the emails sent to the participants, a file of the interview dates and duration and the IRB protocol.