Multi-label classification and knowledge extraction from oncology-related content on online social networks
作者:Mahdi Hashemi, Margeret Hall
摘要
This study aims at automatic processing and knowledge extraction from large amounts of oncology-related content from online social networks (OSN). In this context, a large number of OSN textual posts concerning major cancer types are automatically scraped and structured using natural language processing techniques. Machines are trained to assign multiple labels to these posts based on the type of knowledge enclosed, if any. Trained machines are used to automatically classify large-scale textual posts. Statistical inferences are made based on these predictions to extract general concepts and abstract knowledge. Different approaches for constructing document feature vectors showed no tangible effect on the classification accuracy. Among different classifiers, logistic regression achieved the highest overall accuracy (96.4%) and \(\overline{F1}\) (73.4) in a 13-way multi-label classification of textual posts. The most common topic was seeking or providing moral support for cancer patients, followed by providing technical information about cancer causes and treatments. The most common causes and treatments of different types of cancer on OSN are also automatically detected in this study. Seeking or providing moral support for cancer patients shared the largest overlap with other topics, i.e. moral support tends to be present even in OSN posts which focus on other topics. On the other hand, providing technical information about cancer diagnosis or prevention were the most isolated topics, where OSN posts tend not to allude to other topics. OSN posts which seek financial support only overlap with the moral support topic, if any. Our methodology and results provide public health professionals with an opportunity to monitor what topics and to which extent are being discussed on OSN, what specific information and knowledge are being disseminated over OSN, and to assess their veracity in close to real time. This helps them to develop policies that encourage, discourage, or modify the consumption of viral oncology-related information on OSN.
论文关键词:Cancer, Social networks, Natural language processing, Machine learning, Classification, Knowledge extraction
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10462-020-09839-0