This post describes a project I completed as an Insight Data Science fellow, a program which helps PhD academics transition to careers in Data Science.
Quora is a website where users from the general public can ask questions and crowdsource answers. Posted questions can have multiple answers, and users upvote answers according to their usefulness. As Quora has grown, so has their problem with duplicate questions. The Quora management has identified question duplicates as a major threat to the quality of their product. As well as being annoying, repeated questions divide useful answers and their upvotes among the duplicates, making it challenging for users to find useful answers. To address this problem, Quora allows duplicate questions to be merged. Even with this feature, the site is still struggling with a proliferation of duplicate questions.
A screenshot of a Quora question asking why there are so many duplicate questions on Quora, which itself has been merged with a duplicate of itself. Meta.
On January 30th, 2017, Quora released a dataset of over 400 thousand question pairs, some of were asking the same underlying question and other pairs which were not. The non-duplicate pairs were supplemented with questions which were asking about the same topic and represent topically related but semantically distinct questions. The question pairs were labelled by human taggers as true duplicate questions or non-duplicates, although this labelling is not perfect. (Update: I started on the dataset the day it was released but since then it has gained an associated Kaggle competition where you can find examples of a variety of approaches.)
Two examples from the dataset, one which is asking the same underlying question and one which is topically related, semantically similar, but syntactically distinct.
For my Insight Data Science project, I used natural language processing (NLP) methods to classify these question pairs as either likely duplicates or distinct questions. This is a challenging NLP problem because of the questions are very short, meaning there are few semantic features to use to determine similarity. Subtle changes can also lead to large differences in meaning. As an example, take the questions, “What do women want from men?” and “What do men want from women?” To a simple bag-of-words (BOW) models, which discard grammar or word order information, these questions look exactly the same. When I applied BOW models to the Quora duplicate problem, this was a common failure mode. This suggested to me that despite the additional complexity and computational investment, the best choice for the problem would be an NLP approach which takes into account syntactic information.
Of the approaches I tested, the one with the highest accuracy is based on skip-thoughts. The reason I was drawn to this model was seeing its success on a task determining semantic similarity between short sentence pairs.
From Kiros et al. 2015. Each sentence pair was evaluated by a human for similarity, which is listed in this figure as GT, or ground truth, scored from 1-5 with 5 being the most similar. Skip thoughts performed favorably on these metrics compared to other state of the art NLP methods.
Skip-thoughts provides a method for encoding sentences as whole as vectors as opposed to ad-hoc sentence encodings created by combining vector embeddings for individual words in a sentence, a common NLP approach. The model I used was trained with a recurrent neural network over a period of several weeks. The input for the unsupervised training was a large corpus of unpublished books containing over 74 million sentences. Resulting sentence vectors are embedded in a 4800 dimensional space, and questions encoded using the model which have similar vector representations should represent semantically and syntactically similar questions.
Visual description of the question comparison process.
I used the skip-thought model to encode each of my questions into a vector representation, and then computed what is called the cosine similarity, which compares the two vector representations accounting for the angle between the two vectors. The output from running this method on a subset of the question pairs shows that pairs with high cosine similarity are more likely to be duplicate questions.
Distributions of both duplicate and non-duplicate question pairs as a function of cosine similarity.
The accuracy of the skip-thought classification as a function of the cosine similarity threshold, and tops out at around 67%. It’s likely that the variation seen in the accuracy for cosine similarity thresholds above is due to noise.
Accuracy of question pair classification for the skip-thought method as a function of the cosine similarity threshold.
The receiver operating characteristic (ROC) curve, which plots the false positive rate against the true positive rate, is commonly used to determine the performance of classification tasks. It shows that my Quora question pair classification does much better than random, which would be represented by the blue dotted line. The area under the ROC curve (known as AUC) is 0.72.
Receiver operating characteristic curve, showing classification is much better than random, with an area under the curve (AUC) of 0.72.
Below a certain threshold in cosine similarity, the algorithm identifies non-duplicate pairs with high confidence, which could eliminate a large population of questions from the pool slated for manual review. For those questions which high cosine similarity, however, the false positive rate is high. This method would be best suited to augment and human duplicate identification and make that process more efficient.
If I had time to invest further work towards improving the accuracy of the method, I would apply linguistic parsers or parts of speech identifiers and see if this could be used to engineer relevant features which provide additional clues for question discrimination or could be used in concert with other machine learning algorithms suited to classification problems.
Appendix: Other approaches
Since this was my first foray into NLP, I initially attacked the problem with simple method known as term-frequency – inverse document frequency (tf-idf). Term frequency refers to the number of times a term occurs in a document, logarithmically scaled. The inverse document frequency is the inverse fraction of documents which contain that term, also logarithmically scaled. This has the effect of down weighting the importance of common terms and up weighting the importance of rare terms when determining the The tf-idf is then calculated by multiplying the term frequency by the inverse document frequency. The resulting matrix represents weighted word frequencies and can be used to compare each question pair.
A bumper sticker for bag of words model enthusiasts, drawing attention to the fact that bag of words models disregard both grammar and word order. Modified from a design by Othar Hansson.
Word2vec from the gensim package is commonly used for python-implemented NLP tasks, and it enables training of models for vector encoding and decoding of words. Word2vec based models can contain impressively deep information about the relationship between words. In an oft-cited example for a popular word2vec model, if vector(“Madrid”) represents the vector for Madrid, then vector(“Madrid”) – vector(“Spain”) + vector(“France”) is closest to vector(“Paris”), despite the model not being explicitly told about country/capitol city relationships.
In addition to skip-thoughts, I also applied several word2vec methods. I trained a custom 300-feature skip-gram word2vec model on all words found in the training questions, which represented 80% of the total set of question pairs. Another approach was to use a word2vec model which was pre-trained on a large corpus of Google News articles, yielding a vocabulary of 3 million words, each encoded by a vector with 300 features. Questions were compared to one another using cosine similarity. A third word2vec method I tried used the same Google News word2vec pre-trained model but compared the question pairs using word movers distance, which is an extension of the earth movers distance problem. In each case I removed common words called stopwords.
I tried several approaches over the course of the project, some of which came in slightly below the performance of skip-thoughts. The approaches had AUCs of 0.63, 0.69, 0.70, and 0.70, for tf-idf, word2vec self-trained model, word2vec Google News pre-trained model, and word2vec pre-trained model using word movers distance, respectively.
I would like to thank my fellow Insight Data Science fellows and coordinators for their invaluable feedback. I would especially like to thank Andrej Ficnar and Dan Vatterott for their useful suggestions.