A recent claim about natural language processing (NLP) techniques caught my attention. An Almanac ChatGPT-generated article suggested that “ChatGPT can make sense of vast amounts of qualitative data received from interviews and open-ended evaluation questions” and can “analyze and categorize data, providing insights that can inform decision-making and program improvement.”
Summarizing qualitative responses to open-ended evaluation survey questions has long posed a challenge to the continuing medical education/continuing professional development (CME/CPD) community. However, the promise of ChatGPT to solve this challenge is undermined by ChatGPT’s effects on content integrity. This article describes the current disadvantages of using ChatGPT for qualitative outcomes evaluation and shares ideas for enhancing ChatGPT-generated content integrity.
What Is Qualitative Analysis?
Qualitative outcomes data typically include responses to open-ended survey questions collected after learners complete an education activity or other text-based data, such as interview or focus group data. These “unstructured” data are often stored in a spreadsheet and are typically screened, cleaned and categorized into variables or attributes prior to coding and analysis (e.g., participant designation, setting, degree).
Qualitative coding is essentially interpretive. Its purpose is to condense data into units which are then grouped by relevant characteristics according to a methodological framework that aligns with the research design and goals. For instance, one can code responses to open-ended survey questions in different ways, such as coding segments of text according to predetermined concepts (“a priori” coding), or coding for attributes, emotions, or actions and behaviors (e.g., by looking for gerunds in responses).
Content integrity is as central to qualitative analysis (QA) as it is to the work we do as CME/CPD practitioners. In QA, reliability and validity are the benchmarks for integrity. In any type of QA, we can ensure reliability by, for instance, triangulating data, and ensure analytic validity via an audit trail that documents what one did and why.
Qualitative Outcomes Evaluation via ChatGPT: An Exercise in Magical Thinking
We need time and cognitive effort (i.e., concentration, focus, intuition, and synthetic and integrative thinking) to sort and synthesize qualitative data. This time and effort load represents a significant barrier to analyzing qualitative outcomes data.
In contrast, ChatGPT can:
a) Process large volumes of data more quickly than humans.
b) Potentially maintain analytic consistency throughout data processing.
c) Detect patterns that humans might miss when working with large volumes of text-based data.
However, generative artificial intelligence (AI) has serious drawbacks:
ChatGPT has no capacity for context. While AI is speedy, it does not have the capacity to consider and report underlying nuances, perspectives, metaphors, emotions and concerns in text. It also misses the synonyms and inflections that humans are more likely to detect when coding qualitative data. And while AI might be consistent, it lacks the capacity to understand context. As a result, the insights and recommendations it generates are not based on the full picture of the data.
ChatGPT is not a substitute for analysis. The summaries that ChatGPT generates based on prompts are not the same as analytic themes or categories, regardless of the level of detail included in the prompt. Summaries represent probabilistic patterns in text to which ChatGPT applies labels (in effect, a type of autocoding). These patterns are based on statistical regularities that ChatGPT is exposed to in its training data. You can train ChatGPT in your data, but you won’t always get the same answer in response to the same question. In contrast, QA themes are derived from the systematic application of a categorical or indexical coding framework that is aligned with your research goal.
Content integrity. There are transparency, reliability and validity problems with ChatGPT that undermine content integrity. First, there’s no transparency about why ChatGPT labels and retrieves some text segments and not others, and there’s no audit trail to show how and why it made the retrieval decisions it made. Second, ChatGPT generates inaccuracies when summarizing existing text, including fabricated responses that are convincing but that contain errors, irrelevancies, and misrepresentations of data. At best, ChatGPT serves as a high-level summary of data, but you lose the link to the original data that the generated text builds on. As a result, you also lose the capability to verify the validity of ChatGPT insights. In contrast, computer-assisted qualitative data analysis software (CAQDAS) programs are relational, which means they retain links with the raw data and allow for validity and reliability checking.
Bias and ethics. Unintended biases are hard coded into the digital data that train NLP models like ChatGPT. As a result, these biases are potentially reproduced or amplified in the insights generated from your qualitative outcomes data. Questions also persist about data security, consent and privacy in using AI for data analysis. The text you feed into ChatGPT, especially with tools such as ChatPDF, are no longer private. You can certainly opt out of having your data collected for GenAI training, but neither OpenAI nor third-party providers that use APIs to provide services (i.e., plugins) are obliged to treat your input as confidential. There’s always the possibility that your data will be accessible to others even if your data are de-identified and aggregated. Moreover, although you can train ChatGPT to understand your data, there’s no guarantee that it isn’t also scraping data from external sources to feed its observations about your data.
Suggested Use Cases for ChatGPT
Despite these limitations, ChatGPT could augment, but not replace, the work of human researchers in producing rigorous qualitative outcomes evaluation. Here are some potential use cases:
Pair ChatGPT with other tools. When paired with other tools, ChatGPT-generated summaries could serve as a tool to support the direction of analysis, rather than as a substitute for analysis. For instance, if you are already using plugins to sort, structure, and summarize your data with ChatGPT, it might be simpler to use the concatenate function in Excel or Google Sheets or Automated Qualitative Assistant, which transforms open-ended survey responses into a computer-readable format. CAQDAS programs like MaxQDA, NVivo, or DeDoose are also designed for this purpose, and some, like ATLAS.ti, are beginning to integrate generative-AI technology. Using ChatGPT in conjunction with other tried and tested QA tools will enhance the reliability and validity of your observations and provide a robust grounding for planning and action.
Coding. ChatGPT-generated summaries can certainly be valuable to kickstart the analysis process, but they are best viewed as descriptive codes that offer a starting point for analysis. They are not a substitute for analysis. Consider coding text first using a CAQDAS tool or the concatenate function before asking ChatGPT to summarize that text. If you are intent on using ChatGPT to code your data first (i.e., summarize), consider ChatGPT’s output as first-round coding. Then, manually code a data subset to test the reliability and validity of ChatGPT’s summaries. This approach will help you build a coding framework that includes the descriptions of and explanations for codes. Such a framework is a core element of a QA audit trail.
Prompts. Crafting credible prompting strategies is a bit like choosing Boolean operators when searching bibliographic databases. Experiment with prompts, test out differences in the insights and suggestions each prompt offers, and document your approach. Effective QA prompts should provide context, clear questions, and include examples of data already coded according to a coding framework that makes sense for your analytic goals. Prompts should also specify how you want output structured and formatted.
Consider context. ChatGPT underperforms in understanding nuances or subtext and in accounting for the conditions surrounding a statement or segment of text. Deliberate, effortful, analytic cognition, or what Daniel Kahneman calls “slow” thinking, can be constructive for parsing out how context determines meaning.
Disclosure. In the interests of data privacy, disclose to participants that you are using tools like ChatGPT to explore their comments and perspectives so that they are fully informed of how and where their data will be used and shared. This point is especially pertinent if you are thinking about publishing outcomes data. Using ChatGPT for summarizing and generating insights will create authorship issues, open you up to plagiarism and potentially infringe copyright. The scholarly publishing community has already issued statements that either prohibit the inclusion of AI-generated text in submitted work or discourage the submission of content created by AI technologies. Their rationale is that such content undermines the principal of integrity and fudges the definition of authorship (which, at heart, is based on transparency and accountability).
Proceed With Caution
There might be potential for using ChatGPT in qualitative outcomes evaluation, but we should proceed with caution and assiduously review the implications of this technology. We can do this by considering how to best verify and evaluate ChatGPT-generated insights, how to recognize and mitigate biases in those insights, and how to ensure content integrity.
A former trauma operating room nurse and academic, Alexandra Howson PhD, CHCP, has contributed to CME/CPD as a writer, educator and podcaster since 2010. A frequent presenter at the Alliance Annual Conference, Alex was chair of the Alliance Research Committee 2018–2021 and served as faculty for the CHCP prep course in 2020. She teaches the Fundamentals of Medical Writing Ethics on the Professional Medical Writing Certificate program at the University of Chicago and provides specialist CME/CPD training and professional development for medical writers. Alex is host of Write Medicine, a weekly podcast that explores best practices in creating education content for health professionals.