What you use, not what you do: Automatic classification and similarity detection of recipes

作者:

Highlights:

摘要

Social media data is notoriously noisy and unclean. Recipe collections and their manual categorization built by users are no exception. However, a consistent and transparent categorization is vital to users who search for a specific entry. Similarly, curators are faced with the same challenge given a large collection of existing recipes: They first need to understand the data to be able to build a clean system of categories. This paper presents an empirical study using machine learning classifiers (logistic regression and decision trees) for the automatic classification of recipes on the German cooking website Chefkoch.de. The central question we aim at answering is: Which information is necessary to perform well at this task? In particular, we compare features extracted from the free text instructions of the recipe to those taken from the list of ingredients. On a sample of 5000 recipes with 87 classes, our feature analysis shows that a combination of nouns from the textual description of the recipe with ingredient features performs best in the logistic regression model (48% F1). Nouns alone achieve 45% F1 and ingredients alone 46% F1. However, other word classes do not complement the information from nouns. Decision trees constantly underperform the logistic regression, however, lead to an interpretable model. On a bigger training set of 50,000 instances, the best configuration shows an improvement to 57% highlighting the importance of a sizeable data set. In addition, we report on the use of these feature vectors for similarity search and ranking of recipes and evaluate on the task of (near) duplicate detection. We show that our method can reduce the manual curation with precision@3 = 0.52.

论文关键词:Recipe,Cooking food,Lassification,Multi-label,Text mining,Similarity search

论文评审过程:Received 10 November 2017, Revised 3 April 2018, Accepted 4 April 2018, Available online 6 April 2018, Version of Record 13 October 2018.

论文官网地址:https://doi.org/10.1016/j.datak.2018.04.004