Maximal Marginal Relevance to Re-rank results in KeyPhrase Extraction

Maximal Marginal Relevance a.k.a. MMR has been introduced in this paper The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. MMR tries to reduce the redundancy of results while at the same time maintaining query relevance of results for already ranked documents/phrases etc.

We first try to understand the scenario by taking an example and will see how MMR is helpful in solving the issue.

Recently I was trying to extract KeyPhrases from a set of documents that belongs to one category. I have used different approaches (TextRank, RAKE, POS tagging, etc.. to name a few) to extract keywords from the documents, which provides phrases along with score. This score is used as the ranking of the phrases for that document.

Let’s say your final keyPhrases are ranked like Good Product, Great Product, Nice Product, Excellent Product, Easy Install, Nice UI, Light weight etc. But there is an issue with this approach, all the phrases like good product, nice product, excellent product are similar and define the same property of the product and are ranked higher. Suppose we have a space to show just 5 keyPhrases, in that case, we don’t want to show all these similar phrases.

You want to properly utilize this limited space such that the information displayed by the Keyphrases about the documents is diverse enough. Similar types of phrases should not dominate the whole space and users can see a variety of information about the document.

KeyPhrase Extraction

We are going to address this problem in this blog post. There might be different approaches to solve this problem. For the sake of simplicity and completeness of the article, I am going to discuss two approaches:

Remove redundant phrases using cosine similarity

To use cosine similarity is the naive approach that came to mind to deal with terms having the same meaning. Use word embeddings to find embeddings of phrases and find cosine similarity between embeddings. Set a threshold above which you will consider the terms as similar. Just take one keyPhrase having more score out of clubbed phrases in the result.

 from sklearn.metrics.pairwise import cosine_similarity
 def club_similar_keywords(emb_mat, sim_score=0.9):
    """
    :param emb_mat: matrix having vectors with words as index
    :param sim_score: 0.9 by default
    :return: returns list of unique words from index after combining words which has similarity score of more than
    0.9
    """
    if len(emb_mat) == 0:
        return 'NA'
    xx = cosine_similarity(emb_mat)
    final_keywords = set(emb_mat.index)
    N = len(emb_mat.index)
    dd = {}
    for i in range(N):
        for j in range(N):
            if (float(xx[i][j]) > sim_score) and (i != j):
                try:
                    dd[emb_mat.index[i]].append(emb_mat.index[j])
                except:
                    dd[emb_mat.index[i]] = []
                    dd[emb_mat.index[i]].append(emb_mat.index[j])
    removed_keywords = []
    for key in dd:
        for val in dd[key]:
            if key not in removed_keywords:
                removed_keywords += dd[key]
                try:
                    final_keywords.remove(val)
                except:
                    pass
    return final_keywords

An issue with this approach is that you need to set the threshold (0.9 in code) above which, terms will be clubbed together. And sometimes very close keywords might have cosine similarity < threshold. Word embeddings have been used to convert the sentence to vector by averaging word tokens. Keeping the threshold low will lead to dealing with the same issue again. I find it difficult to manually tweaking this threshold to include all edge cases.

Re-Ranking the KeyPhrases using MMR

The idea behind using MMR is that it tries to reduce redundancy and increase diversity in the result and is used in text summarization. MMR selects the phrase in the final keyphrases list according to a combined criterion of query relevance and novelty of information.

The latter measures the degree of dissimilarity between the document being considered and previously selected ones already in the ranked list. [1]

MMR ranking provides a useful way to present information to the user that is not redundant. It considers the similarity of keyphrase with the document, along with the similarity of already selected phrases.

$MMR = \operatorname*{Arg\,max}_{D_i \in {R/S}}[\lambda (Sim_1(D_i, Q)- (1-\lambda) {\max}_{D_i \in S} Sim_2(D_i, D_j))]$

where, Q = Query (Description of Document category)
D = Set of documents related to Query Q
S = Subset of documents in R already selected
R\S = set of unselected documents in R
$\lambda$ = Constant in range [0 - 1], for diversification of results

In the below implementation of MMR, cosine similarity has been considered as $Sim_1$ and $Sim_2$.

Any other similarity measure can be taken and the function can be modified accordingly.

from sklearn.metrics.pairwise import cosine_similarity
def maximal_marginal_relevance(sentence_vector, phrases, embedding_matrix, lambda_constant=0.5, threshold_terms=10):
    """
    Return ranked phrases using MMR. Cosine similarity is used as similarity measure.
    :param sentence_vector: Query vector
    :param phrases: list of candidate phrases
    :param embedding_matrix: matrix having index as phrases and values as vector
    :param lambda_constant: 0.5 to balance diversity and accuracy. if lambda_constant is high ,      then higher accuracy. If lambda_constant is low then high diversity.
    :param threshold_terms: number of terms to include in result set
    :return: Ranked phrases with score
    """
    # todo: Use cosine similarity matrix for lookup among phrases instead of making call everytime.
    s = []
    r = sorted(phrases, key=lambda x: x[1], reverse=True)
    r = [i[0] for i in r]
    while len(r) > 0:
        score = 0
        phrase_to_add = ''
        for i in r:
            first_part = cosine_similarity([sentence_vector], [embedding_matrix.loc[i]])[0][0]
            second_part = 0
            for j in s:
                cos_sim = cosine_similarity([embedding_matrix.loc[i]], [embedding_matrix.loc[j[0]]])[0][0]
                if cos_sim > second_part:
                    second_part = cos_sim
            equation_score = lambda_constant*(first_part-(1-lambda_constant) * second_part)
            if equation_score > score:
                score = first_part - (1 - lambda_constant) * second_part
                phrase_to_add = i
        if phrase_to_add == '':
            phrase_to_add = i
        r.remove(phrase_to_add)
        s.append((phrase_to_add, score))
    return (s, s[:threshold_terms])[threshold_terms > len(s)]

Setting $\lambda$ to 0.5 gives the optimal mix of diversity and accuracy in the result set. The value of $\lambda$ can be set based on the use-case and your dataset.

MMR helps to address the issue by ranking similar phrases far away. So the issue to select top N keyPhrase has been resolved as all similar terms are not grouped and don’t appear in the final result.

Please let me know if you like the post, or have some suggestions/concerns.

References:

Machine learning NLP Code

Comments

Related Posts

Ranking the results of ML models with a time decay factor for large-scale Anomaly detection 03 Apr 2023

Monitoring Machine Learning Models in Production 02 Mar 2021

Wilson Lower bound Score and Bayesian Approximation for K star scale rating to Rate products 16 Jan 2020