On early detection of high voted Q&A on Stack Overflow

作者:

Highlights:

摘要

Early detection of high quality content on community question answering platforms is an important emerging problem in which the main goal is the detection of high quality questions and answers in a short time right after their submission. Improving the process of question routing, reducing the number of questions with no answers, improving the user experience and also promoting the content quality of a CQA by rejecting low quality contents are all benefits of solving the early detection of high quality content problem in CQA. The main challenge of solving this problem is that the value of a few features is available in a short time after submission of a content in CQA. In other words, unlike previous related research, it is not possible to utilize comprehensive set of features to detect high quality content. In this paper, we view the content quality from the perspective of the voting outcome. Specifically, we consider those Q&A which will get more votes than a certain threshold as high quality posts. Analyzing large amount of data in a CQA, we observed two important patterns which help us with early detection of high quality content. We named the first pattern as accepted answer effect and the second pattern as answer competition effect. According to the first pattern, the chance of a high quality question to get an accepted answer is higher than the chance of other questions and vice versa. According to the second pattern, only few number of answers of a specific question will be high quality answers. We show that these patterns are valid in a short time after the submission of content on CQA. Utilizing these patterns, we propose a unified relational classification framework to solve the problem. In our proposed framework, the quality of a given question and its associated answers can be predicted simultaneously soon after their submission. We conduct several experiments on six data collections gathered from Stack Overflow in order to show the efficiency of the proposed models. Our experiments indicate that the performance of high quality content detection can improve up to 10.7% and 35.3% in comparison with a state-of-the-art independent classifier for questions and answers, respectively. Moreover, we found 1.2% and 11.8% F-measure gain in average versus a recent strong baseline by Yao et al. (2015) for questions and answers, respectively.

论文关键词:

论文评审过程:Received 4 October 2016, Revised 12 January 2017, Accepted 7 February 2017, Available online 10 March 2017, Version of Record 10 March 2017.

论文官网地址:https://doi.org/10.1016/j.ipm.2017.02.005