Abstract
Fake news is a topic that only recently has caught the attention of (corpus) linguists (Grieve & Woodfield, 2023; Sousa Silva, 2022; Trnavac & Põldvere, 2024). Such research has sought to identify differences in linguistic features between fake and real news based on carefully designed corpora. An example of such a corpus is the new PolitiFact-Oslo Corpus (Põldvere et al., 2023), a large dataset of fake and real news in English based on recent events (post-2019). However, in its current form the corpus has some limitations, due to the highly specific, and sensitive, nature of fake news. The present methodological study seeks solutions to these limitations with a view to facilitating future corpus building efforts around fake news, a highly promising area of study for linguists.
As the name implies, the PolitiFact-Oslo Corpus relies on the fact-checking website PolitiFact.com for its data, with each news item being individually labelled for veracity by experts (from ‘True’ to ‘Pants on Fire’). In contrast to many other fake news datasets (e.g., DeClarE in Popat et al., 2018), the corpus is the result of a combination of automatic and manual procedures to have greater control over what is included. In addition to a manual approach to text selection, the corpus is accompanied by important metadata information about the texts, such as their text type (e.g., social media) and source (e.g., X). This said, the corpus currently has two major limitations. Firstly, there is a noticeable imbalance between the fake and real news samples (358,516 vs. 70,401 words, respectively), which is due to the preference of PolitiFact and other fact-checkers to debunk false information rather than to find support for true information. This limitation has serious implications for fake news analysis and detection model development based on the corpus (Põldvere et al., 2023). Secondly, due to copyright and privacy issues the corpus is currently not publicly available, a feature of the corpus which is hardly in line with current open science practices.
We offer some solutions. As for the imbalance between the fake and real news samples, we have decided to extend the scope of the fact-checkers rather than to stretch out the timeline. The fact-checkers are found via Google’s Fact Check Explorer, which provides quick and easy access to more instances of (mostly or half) true news. The challenge is to ensure comparability of the ratings between the fact-checkers (what is ‘Mostly True’ according to one fact-checker may be ‘Half True’ according to another) as well as balance in terms of the metadata information (text type, source). The lack of access to the corpus is a much more complex problem to solve. Inspired by current practices in corpus linguistics, we are exploring opportunities to release the text snippets, rather than the full texts, via an online interface, which, however, is complicated by the legal challenges of distributing fake news data in our national context. We seek solutions to these challenges, too.
As the name implies, the PolitiFact-Oslo Corpus relies on the fact-checking website PolitiFact.com for its data, with each news item being individually labelled for veracity by experts (from ‘True’ to ‘Pants on Fire’). In contrast to many other fake news datasets (e.g., DeClarE in Popat et al., 2018), the corpus is the result of a combination of automatic and manual procedures to have greater control over what is included. In addition to a manual approach to text selection, the corpus is accompanied by important metadata information about the texts, such as their text type (e.g., social media) and source (e.g., X). This said, the corpus currently has two major limitations. Firstly, there is a noticeable imbalance between the fake and real news samples (358,516 vs. 70,401 words, respectively), which is due to the preference of PolitiFact and other fact-checkers to debunk false information rather than to find support for true information. This limitation has serious implications for fake news analysis and detection model development based on the corpus (Põldvere et al., 2023). Secondly, due to copyright and privacy issues the corpus is currently not publicly available, a feature of the corpus which is hardly in line with current open science practices.
We offer some solutions. As for the imbalance between the fake and real news samples, we have decided to extend the scope of the fact-checkers rather than to stretch out the timeline. The fact-checkers are found via Google’s Fact Check Explorer, which provides quick and easy access to more instances of (mostly or half) true news. The challenge is to ensure comparability of the ratings between the fact-checkers (what is ‘Mostly True’ according to one fact-checker may be ‘Half True’ according to another) as well as balance in terms of the metadata information (text type, source). The lack of access to the corpus is a much more complex problem to solve. Inspired by current practices in corpus linguistics, we are exploring opportunities to release the text snippets, rather than the full texts, via an online interface, which, however, is complicated by the legal challenges of distributing fake news data in our national context. We seek solutions to these challenges, too.