Abstract
In this study, we investigate the grammar of fake news by bringing together insights from corpus linguistics and machine learning. While the former offers a robust corpus-based register analysis of grammatical features, namely, multidimensional analysis (Biber, 1988), the latter contributes with methodological capabilities for the automatic identification of fake news based on the features. Fake news detection has made remarkable progress in natural language processing and machine learning (e.g., Rashkin et al., 2017; Põldvere et al., 2023), but it has not taken full advantage of the linguistic resources that are available. Based on the new PolitiFact-Oslo Corpus (Põldvere et al., 2023), we aim (i) to describe the grammatical differences between fake and real news across a variety of text types in a large corpus, and (ii) to develop a deep learning-based efficient approach for fake news detection based on these differences.
A common distinction in multidimensional register analysis is between informational and involved styles of communication. While the former tends to contain more nouns and is common in registers with dense styles of communication such as news reportage, the latter is characterized by a more frequent use of pronouns, verbs and adjectives and is common in spontaneous conversation with lower levels of information density. Departing from the view that fake news is a register in its own right, Grieve and Woodfield (2023) analyzed 49 grammatical features in a small collection of fake and real news texts by one journalist. They found fake news to be more similar to involved styles of communication through the use of, e.g., present tense verbs, emphatics and predicative adjectives. This was different from real news which shared features with informational styles of communication.
In contrast to Grieve and Woodfield (2023), in this study we make use of a large corpus of fake and real news in English: the PolitiFact-Oslo Corpus. The main strengths of the corpus are that the texts have been individually labelled for veracity by experts and are accompanied by important metadata about the text types (e.g., social media, news and blog) and sources (e.g., X, The Gateway Pundit). At present, the corpus contains 428,917 words of fake and real news, and it is growing. To extract the grammatical features, we used the Multidimensional Analysis Tagger (Nini, 2019), followed by a deep learning-based efficient approach (Attention-based Long Short-Term Memory; LSTM) to train the features incriminating fake and real news. The trained model was then used to automatically detect the fake news texts.
The preliminary results based on a sample from the corpus indicate that there are systematic differences between fake and real news, which by and large are indicative of the distinction between involved and informational styles of communication, respectively. However, these differences are not the same across the text types, with social media showing lower levels of information density in fake news than news and blog. Our machine learning model based on the grammatical features also shows promising results (LSTM mean accuracy: 90%), particularly when compared to models without the grammatical features.
A common distinction in multidimensional register analysis is between informational and involved styles of communication. While the former tends to contain more nouns and is common in registers with dense styles of communication such as news reportage, the latter is characterized by a more frequent use of pronouns, verbs and adjectives and is common in spontaneous conversation with lower levels of information density. Departing from the view that fake news is a register in its own right, Grieve and Woodfield (2023) analyzed 49 grammatical features in a small collection of fake and real news texts by one journalist. They found fake news to be more similar to involved styles of communication through the use of, e.g., present tense verbs, emphatics and predicative adjectives. This was different from real news which shared features with informational styles of communication.
In contrast to Grieve and Woodfield (2023), in this study we make use of a large corpus of fake and real news in English: the PolitiFact-Oslo Corpus. The main strengths of the corpus are that the texts have been individually labelled for veracity by experts and are accompanied by important metadata about the text types (e.g., social media, news and blog) and sources (e.g., X, The Gateway Pundit). At present, the corpus contains 428,917 words of fake and real news, and it is growing. To extract the grammatical features, we used the Multidimensional Analysis Tagger (Nini, 2019), followed by a deep learning-based efficient approach (Attention-based Long Short-Term Memory; LSTM) to train the features incriminating fake and real news. The trained model was then used to automatically detect the fake news texts.
The preliminary results based on a sample from the corpus indicate that there are systematic differences between fake and real news, which by and large are indicative of the distinction between involved and informational styles of communication, respectively. However, these differences are not the same across the text types, with social media showing lower levels of information density in fake news than news and blog. Our machine learning model based on the grammatical features also shows promising results (LSTM mean accuracy: 90%), particularly when compared to models without the grammatical features.