Click here to flash read.
This article presents a comparative analysis of the ability of two large
language model (LLM)-based chatbots, ChatGPT and Bing Chat, recently rebranded
to Microsoft Copilot, to detect veracity of political information. We use AI
auditing methodology to investigate how chatbots evaluate true, false, and
borderline statements on five topics: COVID-19, Russian aggression against
Ukraine, the Holocaust, climate change, and LGBTQ+ related debates. We compare
how the chatbots perform in high- and low-resource languages by using prompts
in English, Russian, and Ukrainian. Furthermore, we explore the ability of
chatbots to evaluate statements according to political communication concepts
of disinformation, misinformation, and conspiracy theory, using
definition-oriented prompts. We also systematically test how such evaluations
are influenced by source bias which we model by attributing specific claims to
various political and social actors. The results show high performance of
ChatGPT for the baseline veracity evaluation task, with 72 percent of the cases
evaluated correctly on average across languages without pre-training. Bing Chat
performed worse with a 67 percent accuracy. We observe significant disparities
in how chatbots evaluate prompts in high- and low-resource languages and how
they adapt their evaluations to political communication concepts with ChatGPT
providing more nuanced outputs than Bing Chat. Finally, we find that for some
veracity detection-related tasks, the performance of chatbots varied depending
on the topic of the statement or the source to which it is attributed. These
findings highlight the potential of LLM-based chatbots in tackling different
forms of false information in online environments, but also points to the
substantial variation in terms of how such potential is realized due to
specific factors, such as language of the prompt or the topic.
No creative common's license