The PolitiFact-Oslo Corpus: A New Dataset for Fake News Analysis and Detection

Abstract

This study presents a new dataset for fake news analysis and detection, namely, the PolitiFact-Oslo Corpus. The corpus contains samples of both fake and real news in English, collected from the fact-checking website PolitiFact.com. It grew out of a need for a more controlled and effective dataset for fake news analysis and detection model development based on recent events. Three features make it uniquely placed for this: (i) the texts have been individually labelled for veracity by experts, (ii) they are complete texts that strictly correspond to the claims in question, and (iii) they are accompanied by important metadata such as text type (e.g., social media, news and blog). In relation to this, we present a pipeline for collecting quality data from major fact-checking websites, a procedure which can be replicated in future corpus building efforts. An exploratory analysis based on sentiment and part-of-speech information reveals interesting differences between fake and real news as well as between text types, thus highlighting the importance of adding contextual information to fake news corpora. Since the main application of the PolitiFact-Oslo Corpus is in automatic fake news detection, we critically examine the applicability of the corpus and another PolitiFact dataset built based on less strict criteria for various deep learning-based efficient approaches, such as Bidirectional Long Short-Term Memory (Bi-LSTM), LSTM fine-tuned transformers such as Bidirectional Encoder Representations from Transformers (BERT) and RoBERTa, and XLNet.

Read publication

Client

Research Council of Norway (RCN) / 302573

Language

English

Author(s)

Nele Poldvere
Md Zia Uddin
Aleena Thomas

Affiliation

University of Oslo
SINTEF Digital / Sustainable Communication Technologies

Year

2023

Published in

Information

ISSN

2078-2489

Publisher

MDPI

Volume

Issue

External resources

https://www.hf.uio.no/ilos/english/research/projects/fakespeak/

DOI

https://doi.org/10.3390/info14120627

Read fulltext

https://hdl.handle.net/10852/106230

View this publication at Cristin

Contact us

Our services

Career

Sustainability

Management and board

Institutes

Other units

About us

Follow us