Vaccine Misinformation in Twitter Data

December 13, 2021

While the past few years have seen mind-blowing gains in what AI can accomplish with natural language processing,  properly categorizing unstructured language that contains humor, sarcasm, wordplay, and irony remains a challenge.  This challenge grows exponentially when the use case requires domain expertise and judgment from human analysts, as scaling this expertise across large, unstructured text datasets is usually neither affordable nor realistic.

Harvard University researchers and PhD candidates Soubhik Barari and Sophie Hill and their advisor, Harvard professor Gary King, know this challenge well.  They are tackling one of the biggest social problems facing the world today: combatting health misinformation on social media, specifically on Twitter.  

Recognizing that manually reviewing and labeling a dataset of 300,000 Twitter posts was not the most efficient use of their time, the Harvard researchers approached for help in scaling their expertise and sorting through their massive pile of vaccine related Twitter posts.  One of their first tasks was to characterize the subtle - and not-so-subtle - vocabulary that is adopted by spreaders of anti-vaccine misinformation.   In machine learning, in order to identify some phenomenon well, you also have to identify everything that appears to be, but actually isn’t that phenomenon. 

QuickCode, with its human-in-the-loop machine learning technique, helped the team separate vaccine misinformation language from scientifically corroborated information about the vaccine.  For example, phrases such as “experimental vaccine”[1] and praise for Dr. Zev Zelenko[2] are  almost exclusively used by individuals spreading vaccine misinformation.

To the contrary, phrases like "anti-vax" and "skeptic", while topically aligned with vaccine misinformation, are almost never used by individuals who share vaccine misinformation.[3] The researchers quickly discovered this when QuickCode produced these as exclusion phrases for the initial keyword queries.

QuickCode systematically revealed some of the common narratives and slogans in (and not in) the vaccine misinformation. For example, QuickCode recommended that the phrase "don't have to worry" as negatively associated with the misinformation-sharing keyword sets.[4] It turns out that this phrase is used in a recurring meme shared by vaccine supporters in the format of "if you've ever done [something gross], you don't have to worry about what’s in the vaccine":

On the other hand, QuickCode helped the researchers identify unsubstantiated narratives about the efficacy of alternative treatments for COVID-19, in particular ivermectin.[5]

QuickCode provides two forms of output - curated text datasets and keyword query classifiers.  After iteratively compiling a keyword set using QuickCode, the researchers deployed their keyword classifier to the Twitter firehose and evaluated the classification accuracy for a sample of 200 users. They found that the QuickCode keyword set produced a 95% accuracy, a false positive rate of less than 10% and a false negative rate of nearly 0%.

The noise that exists in Twitter datasets presented a huge challenge for the Harvard researchers. By combining their own expertise with QuickCode’s machine learning recommendations, they were able to narrow their large dataset to a smaller, more manageable subset of highly relevant and representative Tweets, improving the quality of their misinformation research data. 

Identifying misinformation - be it in the form of conspiracy theories, unverified rumors, or unproven scientific claims – is not just a problem for social media platforms, but for any organization that allows an open exchange of ideas.  QuickCode is the place to start.