Standardized Data Set Annotations Could Aid in Detecting Social Media Sentiments

Byron SpiceThursday, August 3, 2023

SCS Ph.D. student Lynnette Hui Xian Ng cautions that the machine learning models used to analyze huge data sets from social media to ascertain public attitudes have limitations researchers should be aware of.

Social media provides an important window into the public zeitgeist, generating humongous data sets that reflect attitudes on everything from abortion to the latest Taylor Swift concert. Tapping this resource would be impossible without artificial intelligence, but Carnegie Mellon University researchers caution that the machine learning models employed to analyze these data sets — known as stance detection models — have some limitations.

Stance detection models allow AI to identify positive, negative or neutral reactions in large data sets from social media posts. However, a model trained using a data set on one topic will have trouble assessing the attitudes present in data related to a separate topic, according to research by Lynnette Hui Xian Ng, a Ph.D. student in social computing in the School of Computer Science's Software and Societal Systems Department. For example, a model trained to detect public sentiments using a Twitter data set on politics will have trouble discerning whether sentiments are positive, negative or neutral in a Twitter data set on COVID-19 vaccines.

Ng explored the generalizability of stance detection by training models that each used one of seven publicly available data sets on such topics as 2016 politicians, atheism, and company mergers and acquisitions. She then tested each of the models against a data set that was not used to train it. In each case, the accuracy of the stance predictions was unsurprisingly poor.

Using multiple data sets to train the models improved their performance with other data sets, but Ng noted those models were less likely to detect nuances, such as sarcasm. Using multiple data sets for training would be improved if the data sets were annotated in the same way and with common definitions, she said.

Based on this work, completed with Kathleen Carley, a professor of societal computing, Ng urges researchers to use standardized labels and consistent schemes when they annotate the data sets used for training these machine learning models. The research report received the Best Ph.D. Paper for 2022 by the journal Information Processing and Management.

For More Information

Aaron Aupperlee | 412-268-9068 | aaupperlee@cmu.edu