1,391
Views
1
CrossRef citations to date
0
Altmetric
Special issue section on Infodemiology and Infodemic Management

NLP and Deep Learning Methods for Curbing the Spread of Misinformation in India

, , , , , & show all
Pages 216-227 | Received 12 Oct 2020, Accepted 27 Aug 2021, Published online: 16 Nov 2021

ABSTRACT

The current fight against COVID-19 is not only around its prevention and cure but it is also about mitigating the negative impact resulting from misinformation around it. The pervasiveness of social media and access to smartphones has propelled the spread of misinformation on such a large scale that it is considered as one of the main threats to our society by the World Economic Forum. This ‘Infodemic’ has caused widespread rumors, fueled practices that can jeopardize one’s health, and has even resulted in hate violence in certain parts of the world. We built an engine that has the ability to match incoming text, which may contain correct or incorrect information, with a known repository of misinformation. By matching texts on embeddings generated using BERT, we evaluated paraphrased texts to see if they matched texts previously labeled as misinformation. Further, we augmented an existing data corpus of texts by tagging each misinformation with one or more impact categories. We may be able to take specific actions to avert the consequence of misinformation if we can predict the particular ramification of a certain type of misinformation.

Introduction

December 2019 saw the emergence of a new viral pathogen, now called SARS-CoV-2, which causes the disease COVID-19. The spread of COVID-19 increased exponentially worldwide in March 2020 and was subsequently categorized as a pandemic by the World Health Organization (WHO, Citation2020b). Alongside the spread of the disease itself, occurred an equally rampant spread of unverified and incorrect information surrounding the pandemic, which health regulatory agencies have since referred to as an “Infodemic” (Dong & Bouey, Citation2020; WHO, Citation2020a). Despite the pervasiveness of misinformation, studies looking at their prevalence are infrequent. Kreps and coworkers looked at the uptake of misinformation categories of origin, treatment, and government response to the disease, and found that a third of the study participants had come across misinformation about COVID-19 (Kreps & Kriner, Citation2020). The country of India in particular, has been particularly susceptible to the spread of online misinformation, with 687 million Indians out of 1.38 billion being able to access the internet and online news from mobile phones and 400 million using the popular social networking platform Facebook and its messaging service WhatsApp (Banerjee & Haque, Citation2018). It is not only the common people who are often faced with misinformation, a survey of over 700 Indian health care professionals showed that 47.2% of health care professionals encountered misinformation on social media and 26.7% via their family and friends (Datta, Yadav, Singh, Datta, & Bansal, Citation2020). India has been fighting two pandemics simultaneously, COVID-19 and the equally dangerous misinformation that surrounds it. There is a dearth of studies that look at categories of misinformation surrounding COVID-19, such as economic, communal, and social angles. One barrier to rigorous misinformation research is that there have been no efforts to categorize misinformation on a wide range of categories. This present effort was undertaken to map misinformation to pre hoc categories in order to:

  1. Determine the prevalence of misinformation by category of impact, and

  2. Create an open-access curated corpus that can be reused by other investigators.

Figure 1. Top 20 unigrams among false news.

Figure 1. Top 20 unigrams among false news.

Figure 2. Top 20 bigrams among false news.

Figure 2. Top 20 bigrams among false news.

Figure 3. Top 20 trigrams among false news.

Figure 3. Top 20 trigrams among false news.

Figure 4. Word cloud of most popular words among false news.

Figure 4. Word cloud of most popular words among false news.

Related work

The need for Natural Language Processing and automatic fact-checkers in detecting misinformation:

Automatic fake news detection has become a Natural Language Processing (NLP) task useful to all online content providers, in order to reduce the human effort required to prevent the spread of misinformation (Oshikawa, Qian, & Wang, Citation2020). Fact-checking by humans is an intellectually demanding process, it takes about one day to research and write an article to debunk a claim, leaving many harmful claims unchallenged, specially at the local level. This presents an immediate need for automatically detecting and labeling such news and posts (Hassan, Arslan, Li, & Tremayne, Citation2017). However, automated detection of misinformation is a hard task to accomplish as it requires the model to understand nuances in natural language. NLP models have made tremendous progress in the automatic detection of sentiment and in mining opinions from text, and the application of NLP to the detection of misinformation has been explored by many (Thota, Tilak, Ahluwalia, & Lohia, Citation2018).

Our approach

We take an alternative approach to detecting misinformation by finding the most similar misinformation in our repository to the text being evaluated. Our repository contains misinformation instances that are already confirmed to be incorrect information. This would enable the users to access the broader context of misinformation around the text in question. This task is performed by combining techniques like evaluating text similarity and automated paraphrasing.

Text similarity

Text similarity has recently been explored extensively. Reimers and Gurevych (Citation2019) proposes Sentence-BERT, which is based on the pre-trained BERT and utilizes the Siamese network to find sentence embeddings. These sentence embeddings can be compared using cosine similarity to find the most similar pairs of text. The paper also shows that the same accuracy has been achieved while reducing the runtime from 65 hours to 5 seconds compared to BERT/RoBERTa. In another paper (Cer at al., Citation2018), easy-to-use encoding models are presented to generate sentence-level embeddings and showed good performance on various NLP problems.

Paraphrasing

The next problem that we explored is the automation of paraphrasing of previously identified misinformation to expand the dataset and for testing of our text similarity matching algorithm. A misinformation can be presented in various ways. Hence, we paraphrased the claims in the dataset and tested to see if our similarity checker is able to match a text in question to one in a set of paraphrases. Fu and colleagues (Fu, Feng, & Cunningham, Citation2019) proposed a paraphrase generation model based on a latent bag of words (BOW). The proposed method tries to find the neighbors for a given word and then models the target BOW. On top of the predicted BOW distribution, it performs subset sampling by using Gumbel top-k reparameterization. In another paper (Li, Jiang, Shang, & Li, Citation2018), a deep reinforcement learning approach to paraphrase generation has been presented, which consists of a generator and an evaluator. A Seq2Seq learning model is used to build the generator and is used for paraphrasing of sentences. To check whether the given two sentences are paraphrases, the evaluator module is used. This evaluator module is constructed using deep matching and is employed to fine-tune the generator model. Li and company presented a Transformer-based Decomposable Neural Paraphrase Generator (Li, Jiang, Shang, & Liu, Citation2019), which aims to learn and produce summaries of a sentence at various degrees of granularity in an unraveled manner. The model uses numerous encoders and decoders consisting of various structures, every one of which refers to a particular granularity.

Misinformation repository

The primary dataset of misinformation we used is the FakeCovid dataset for misinformation that contains 7623 fact-checked COVID-19 related news articles (Shahi & Nandini, Citation2020). This dataset contains fact checked articles from 92 sites and each article is labeled with the accuracy of the claim. This dataset contains misinformation from 105 countries in 45 languages. For the purposes of our project, we chose to only use misinformation that originated in India and was in English. The number of instances of misinformation originated in India in the dataset is 731.

Analysis of dataset

The top 20 N-grams (see ) for the dataset of Indian misinformation displays how the misinformation is spread, what format it is in, and what parts of the pandemic it concerns. By looking at the top 20 unigrams, it appears much of the misinformation being spread refers to the government, the police, and the lockdown or to the countries of China and Italy which had the most coronavirus cases initially, or to Muslims. By taking a look at the bi-grams, we can see that most of the misinformation is being spread via video and images on Facebook and Twitter. It also appears that a substantial amount of misinformation is being spread about India’s Prime Minister, and the state of Uttar Pradesh seems to be a hotspot for misinformation. The findings are similar when looking at the tri-grams: much of this news is originating from social media.

From the word cloud, it appears that there is also a substantial amount of misinformation surrounding the vaccine and possible treatments for the coronavirus. In addition to this dataset, we have created a subset of 131 India-based misinformation instances (typically a sentence or two) in English by paraphrasing the misinformation from the FakeCovid dataset manually. We have annotated the subset by tagging each misinformation with one or more impact categories. (Cures/Treatment, Political Impact, Economic Impact, Increasing Stigma, Evoking Fear, Communal Impact, Susceptibility, Misrepresentation of Public Figures or Government). The paraphrases are of two main types, subtle and obvious. We will make this dataset publicly available with the release of this paper.

Method

To make our model robust to different ways of expressing the same information, we used both automated and manual methods to paraphrase texts in the primary dataset. We used three different methods for this task. In the first method, Google Translate was used to translate the text in English to a different language (Thai for this experiment). The Thai sentences were then translated back into English, resulting in sentences phrased differently from the original sentences. The second method used BART to create representations of texts which were then coupled with a language generation module to create up to three paraphrases for each text in the original dataset. BART is an Encoder-Decoder Transformers that is considered state of the art for language paraphrasing tasks. BART provided more extensive paraphrasing of the texts in the database than the Translation method. Finally, we employed Text-to-Text Transfer Transformer (T5) (Raffel et al., Citation2020) to generate paraphrases, similar to BART. T5 is a pre-trained model useful for many NLP tasks like summarization and paraphrasing. T5 was used to create the sentence representation, following which its language generation module was used to create up to two paraphrases for each text in the original dataset.

The similarity checker consists of a module that creates the word embeddings of the two input sentences, and the cosine similarities between the two sets of embeddings is reported. We used BERT to generate the embeddings. If the cosine similarity is above a threshold of 0.80 (tuned to empirical data), we classify the two sentences to be a match. The similarity checker is called with the paraphrases generated above and the source sentence to evaluate its accuracy.

Results

Automated paraphrasing

The automated paraphrasing exercise is conducted using three approaches: T5, BART, and Google Translate. Based on our observations, T5 generates complex but unpredictable paraphrases, sometimes to the extent that it alters the meaning of the original text or adds new information to it. For instance, “Claim that photos show dogs being killed in China as part of efforts to combat COVID-19.” was paraphrased to “Photos of dogs were destroyed in China due to COVID-19. Is it true that the animals were killed by the Chinese on November 4, 1991?” On the other hand, BART produced simpler paraphrases as compared to T5. The paraphrasing is performed by a combination of operations like substituting words and phrases with their synonyms (eg. ‘claims’ and ‘suggests’), joining and splitting sentences, adding words and phrases, changing the tense, and reordering parts of the sentences. We have observed that these changes generally do not alter the meaning of the text. Last, Google Translate also produced simple paraphrases. Most of the paraphrases seem correct and are generally based on adding/deleting/substituting words and phrases, changing the tense, and reordering parts of the sentence. It does not generate out-of-the-box paraphrases that might be part of the real-world dataset. We have also included some examples of paraphrases using each method in .

Table 1. Sample paraphrases for different paraphrasing algorithms

Manual paraphrasing

The automated paraphrases are generated by AI and might be different from the human-generated paraphrases. Therefore, we have created a small dataset of manually generated 131 paraphrases for validating the text-similarity algorithm on paraphrases that weren’t generated automatically.

Text similarity

We have compared the misinformation texts with their paraphrases by evaluating cosine similarity between the embeddings of the texts and the embeddings of their paraphrases. The matching algorithm is able to identify most of the manually generated paraphrases and the automatically generated paraphrases generated by BART and Google Translate. For T5, the matching algorithm does not perform as well, which might be partly attributed to the fact that T5 alters the meaning of the text while paraphrasing. The results of this exercise are included in . In addition, we performed an experiment to distinguish between similar and dissimilar text pairs. The similar text pairs are 131 misinformation texts and their corresponding manually generated paraphrases, and the dissimilar texts are chosen by pairing two different misinformation texts from our dataset of 131 manual paraphrases. For matching dissimilar texts, we performed 100 iterations of matching 131 different pairings of misinformation texts for each iteration. We have reported the worst-case scenario results in . As shown in , the matching algorithm performed well on the manually generated dataset. We also observe in and that the results of the matching algorithm on automated paraphrasing algorithms like BART and Google Translate are similar to that of manual paraphrasing. We plan to test the matching algorithm on a larger dataset of manual paraphrases.

Table 2. Performance of paraphrasing techniques

Table 3. Performance of matching algorithm on manual paraphrasing

Discussion

COVID-19 has brought to the fore a complex problem arising out of the ubiquity of unverified and potentially harmful information. Although there have been many initiatives to curb the spread of misinformation, it is a very difficult task because there is no single truth value to compare with. We are in a constant state of flux and the advisories on topics like wearing masks keep changing as the world learns more about COVID-19 (Peeples, Citation2020). Hence, fact-checking alone might not be enough. Instead, a better way could be to educate people by providing more context around the information being searched. Second, as misinformation propagates, it gets altered by the intermediary links. This calls for the need of a robust fact-checker that matches text semantically. Last, for countries like India, the propagation of misinformation might include the introduction of certain linguistic errors such as grammar and spelling mistakes common among non-native speakers of English. This makes fact-checking even more difficult for such countries.

Generating paraphrases manually is an expensive and time-intensive process. We explored different methods of automating paraphrasing using Google Translate and algorithms like BART, and our results are comparable to manual paraphrasing. We also show that our text-matching algorithm, which compares the texts by evaluating the cosine similarity between their embeddings generated by the BERT, performs well on identifying similar/dissimilar information for both automated and manual paraphrasing.

Finally, we have augmented the FakeCovid dataset by adding paraphrases and impact categories. This would assist with broad level categorization of the misinformation and would improve the search capabilities of fact-checking engines. It would also help in creating, and later automating, the pipeline of misinformation from its source for those who can play a vital role in curbing the its spread, from

policymakers to healthcare workers. We will be releasing a fact-checking website based on this work to the public with the release of this paper.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Notes on contributors

Amber Nigam

Amber Nigam Co-founder of HealthDataRecipe.ai Incoming student for MS in Health Data Science, Harvard TH Chan School of Public Health, Harvard University. Amber Nigam has a bachelor’s degree in Computer Science and is an incoming student at HSPH, Harvard University for the Health Data Science program. He has 9 years of industry and entrepreneurial experience in the domain of data science. His interest lies at the intersection of data science, NLP, health, and education. In particular, he is interested in exploring the semantic analysis of natural language content and precision medicine.

Pragati Jaiswal

Pragati Jaiswal MPH, Harvard TH Chan School of Public Health, Harvard University. Pragati is a Senior Operations Manager at mClinica, a health tech startup based out of Singapore. She received her MPH from Harvard University. She has a background in pharmacy and business. She has a combined experience of more than five years of working in operations and business consulting in healthcare. She likes to solve problems in the social impact space and in her spare time focuses on interdisciplinary research projects usually centered on data science and public health.

Saketh Sundar

Saketh Sundar Research Intern at MIT. Saketh is a student at River Hill High School, Maryland and currently works as a research intern at MIT. His research interests include applications of machine learning and artificial intelligence in public health and medicine. He has been working on various research projects at MIT including forecasting Covid-19 deaths and excess mortality analysis.

Mukund Poddar

Mukund Poddar MS in Health Data Science, Harvard TH Chan School of Public Health, Harvard University. Mukund did his undergraduate in Computer Science and Engineering, and is currently seeking to learn more about the healthcare industry. He wants to help enhance primary care and clinical outcomes using his knowledge of Machine Learning and Data Science.

Nitya Kumar

Nitya Kumar Lecturer in Public Health and Epidemiology at Royal College of Surgeons in Ireland . Nitya Kumar is teaching faculty at Royal College of Surgeons in Ireland at their Bahrain campus. In addition to teaching Public Health Epidemiology, Evidence Based Medicine and Biostatistics to students of Medicine and Nursing, she is the Statistical Advisor to the Research Ethics Committee at RCSI Bahrain. Nitya’s experience lies in quantitative research in public health spanning across India, Bahrain, Malawi and the United States, and she is currently a leading member of several institutional and extra mural research projects. She mentors senior students of medicine in research projects focused on analysis of datasets from population based cohorts. Nitya received her doctorate in Public Health Nutrition from Maharaja Sayajirao University of Baroda in 2015 and following that she received her second Masters in Epidemiology from Harvard University in 2018.

Franck Dernoncourt

Franck Dernoncourt Research Scientist at Adobe. Franck is a researcher at Adobe Research in San Jose. He received his PhD in machine learning from MIT. His research interests include neural networks and natural language processing. He has worked on several NLP applications such as text classification, sequential labeling, named-entity recognition, question-answering systems, summarization, dialog systems, image editing by voice, and speech recognition.

Leo A. Celi

Leo A. Celi Principal Research Scientist, Massachusetts Institute of Technology, Clinical Research Director, Laboratory of Computational Physiology. Co-Director, MIT Sana Associate Professor of Medicine, Part-time, Harvard Medical School. Leo Anthony Celi has practiced medicine in three continents, giving him broad perspectives in healthcare delivery. As clinical research director and principal research scientist at the MIT Laboratory of Computational Physiology (LCP), he brings together clinicians and data scientists to support research using data routinely collected in the intensive care unit (ICU). His group built and maintains the Medical Information Mart for Intensive Care (MIMIC) database. This public-access database has been meticulously de-identified and is freely shared online with the research community.

References

  • Banerjee, A., & Haque, M. N. (2018). Is fake news real in India? Journal of Content, Community and Communication, 4(8), 46–49. doi:https://doi.org/10.31620/JCCC.12.18/09
  • Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., … Sung, Y. H. (2018). Universal sentence encoder. arXiv preprint arXiv:1803.11175.
  • Datta, R., Yadav, A. K., Singh, A., Datta, K., & Bansal, A. (2020). The infodemics of COVID-19 amongst healthcare professionals in India. Medical Journal, Armed Forces India, 76(3), 276–283. doi:https://doi.org/10.1016/j.mjafi.2020.05.009
  • Del Vicario, M., Bessi, A., Zollo, F., Petroni, F., Scala, A., Caldarelli, G., … Quattrociocchi, W. (2016). The spreading of misinformation online. Proceedings of the National Academy of Sciences, 113(3), 554–559. doi:https://doi.org/10.1073/pnas.1517441113
  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805.
  • Dong, L., & Bouey, J. (2020). Public mental health crisis during COVID-19 pandemic, China. Emerging Infectious Diseases, 26(7), 1616–1618. doi:https://doi.org/10.3201/eid2607.200407
  • Fu, Y., Feng, Y., & Cunningham, J. P. (2019). Paraphrase generation with latent bag of words. Advances in Neural Information Processing Systems (pp. 13645–13656).
  • Hassan, N., Arslan, F., Li, C., & Tremayne, M. (2017). Toward automated fact-checking: Detecting check-worthy factual claims by claimbuster. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1803–1812. doi: https://doi.org/10.1145/3097983.3098131
  • Kreps, S., & Kriner, D. (2020, June 10). Medical misinformation in the Covid-19 pandemic. Retrieved from osf.io/jbgk9
  • Li, Z., Jiang, X., Shang, L., & Li, H. (2018). Paraphrase generation with deep reinforcement learning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 3865–3878).
  • Li, Z., Jiang, X., Shang, L., & Liu, Q. (2019, July). Decomposable neural paraphrase generation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3403–3414).
  • Oshikawa, R., Qian, J., & Wang, W. Y. (2020). A survey on natural language processing for fake news detection. 6086–6093. Retrieved from https://www.aclweb.org/anthology/2020.lrec-1.747
  • Peeples, L. (2020, October 06). Face masks: What the data say. Retrieved from https://www.nature.com/articles/d41586-020-02801-8
  • Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67.
  • Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
  • Shahi, G., & Nandini, D. (2020). FakeCovid- A multilingual cross-domain fact check news dataset for COVID-19. Proceedings of the 14th International Conference on Web and Social Media. 14th International Conference on Web and Social Media. Retrieved from http://workshop-proceedings.icwsm.org/pdf/2020_14.pdf
  • Tasnim, S., Hossain, M. M., & Mazumder, H. (2020). Impact of rumors and misinformation on COVID-19 in social media. Journal of Preventive Medicine and Public Health, 53(3), 171–174. doi:https://doi.org/10.3961/jpmph.20.094
  • Thota, A., Tilak, P., Ahluwalia, S., & Lohia, N. (2018). Fake news detection: A deep learning approach. SMU Data Science Review, 1(3), 21.
  • World Health Organization. (2020a). 2019 novel coronavirus (‎‎2019-nCoV)‎‎: Strategic preparedness and response plan. Author.
  • World Health Organization. Coronavirus disease (COVID-19)-events as they happen. (2020b). Retrieved from https://www.who.int/emergencies/diseases/novel-coronavirus-2019/events-as-they-happen.