Aditya Kumar

Ranking the results of ML models with a time decay factor for large-scale Anomaly detection

2023-04-03T00:00:00+00:00

In this blog post, I will go through one of the problems I have recently faced that seems to be reoccurring in some form or another, which requires a ranking of the results produced by machine learning models in absence of implicit, explicit, or delayed feedback. I will explain the background, and problem and then discuss the approaches that can be used to solve the problem. And finally, how it can be easily implemented in any database and can be scheduled as a task in DB.

Background

I was recently working on a problem that involved detecting anomalies in VPC flow logs. VPC flow logs are the event logs of communication happening between two instances or more specifically between two IP addresses in a virtual private cloud and capture different aspects of communication like time, duration, bytes transferred, packets transferred, Protocol used, Port used, and many more. For any cloud-native organization, the data generated by VPC logs can be too much, in my case I was getting ~12–15 Billion records per hour which is a very high frequency (queries per second i.e. QPS). A large volume and high data velocity add complexity to the problem to be solved. The task here is to detect anomalies in the data based on specific attributes of communication. These anomalies will help in finding any vulnerability in the cloud infra and therefore securing the infra of the organization. These identified anomalies will be then verified by the incidence response team for false positive alerts or genuine threats.

Problem Description

ML model is trained with the incoming data every hour and it detects a fraction of anomalies in these logs and writes these to the table. Due to the large volume of the data, a very small fraction of anomalies generated per hour can be too much for the consumer team to investigate. Every day these anomalies will be sent to the team for further investigation and feedback will be provided to us. This feedback needs to be incorporated into model training and prediction.

This blog post is focused on ranking the results of ML models which will help the downstream team to prioritize the investigation of the anomalies.

The current process looks like this.

Properties of the Ranking function

We want to achieve the following properties from the ranking function. These properties are more tailored to my use case but can be easily modified based on the problem and needs

A score between 0 to 1. A score closer to 1 means a higher priority and vice versa.
More weightage to the anomalies that occurs multiple times within a time frame.
Scale the score of the anomaly if it was detected before by the ML model. Recently detected anomalies need to be weighted more compared to the anomaly that was detected before.
Consider the consumer team’s feedback on already checked responses and incorporate that into your ML system.

Ranking Function

We will consider the ranking function in two parts:

Baseline Ranking function: The baseline ranking function provides a score for the detected anomalies. There can be multiple approaches to it but will consider frequency-based approaches in this blog post to keep it simple and complete.
Weighing Function for historical detections: The weighing function is a parameter that will include the historical score at a discounted factor.

Baseline Ranking function

Step function: The idea behind this approach is to use the frequency of occurrence of an anomaly as the ranking measure, i.e. if one anomaly appears multiple times in the result, then should be ranked higher than any anomaly which appears less number of times in the result. The score of the function needs to be in the range of [0,1] , we can divide the final score of anomaly by max_value so that score will lie in the desired range. Ranking_anomaly = F(frequency_anomaly, max_value)
Sigmoid Function

The sigmoid function is widely used in ML and the range of this function is already in [0,1]. This function has some properties that need to be modified to adapt it to our needs i.e

𝜎(x) → 0 as, x → ≤-4 , we want the score to be as close to 0 when x → 0, therefore the range of this function needs to be shifted to the right by 4 units.
𝜎(x) → 1 as, x → ≥ 3 i.e. the value of x greater than 3 will be close to 1, we need to scale this function to reach a score close to 1 when let’s say we have at least 10 anomalies. i.e. function needs to be scaled by 0.7

Also in our case, if an anomaly is detected 10 times is as important as if it is detected 24 times in a day and still the score for the latter is greater than the former (although the magnitude is significantly less) and that is precisely what we want.

Now the equation becomes

and the graph looks like this

Both of the above approaches work fine for most cases, but this one doesn’t account for the historical detections by the ML model for more than 1 day. To consider anomalies which are detected previously i.e. not on the same day, we can define some constant values for the historical detections. We want to give more weightage to anomalies that have been detected recently.

Weighing Function for historical detections

The idea here is to use a function that accounts for the score for the historical detections. We will use 𝛼 to denote this, basically in this case the whole equation will become something like this

Constant value for discounted historical score: Any value in the range [0,1) will discount the score. This score can be adjusted based on the requirements and use case. There can be a different variation of this score, one variation is having separate weights for each day i.e. 0.8 for a 1-day difference, 0.3 for a 2-day difference, etc. Another variation can be defined by equal weight to each day and can be changed based on the needs and underlying problem to be solved. This can be easily implemented by maintaining a data structure that keeps the weights for each day and multiplying the score with this weight.
Time decay factor: This function will give more weight to recently detected anomalies and less weight to the anomalies detected earlier.

How long will it take to diminish the effect of an entry in the result?

This value of 𝛼 discounts scores at a rate of 0.5 i.e. for every day passed it changes the score to half. Small scores will vanish very quickly and larger scores will take a few days to come close to 0.

Implementation

This can be easily implemented in any database using SQL queries and can be scheduled as tasks. Anomaly detected are stored in a table with the timestamp which helps in tracking how many anomalies are generated in a particular hour or day. So finally the problem comes down to “how to write the above-explained ranking function?”.

The idea is to bring the previously ranked anomalies to one intermediate table and sum up the score of the current day with the historical score by joining the whole table. The historical score can be adjusted based on the flavor of the ranking function you want to use. Visually it can be understood as below figure

Each run can be scheduled as a task at a regular cadence based on the requirement. The consumer team can start looking at the anomalies and do the investigation based on the updated score from the historical data. Feedback from the team can be directly incorporated into the system either via a ranking function or can be used to filter data while training the ML model or at inference time. These decisions can be taken accordingly based on need and use case.

Please let me know if you like the post, or have some suggestions/concerns and feel free to reach out to me on LinkedIn.

References:

https://docs.snowflake.com/en/user-guide/tasks-intro

Monitoring Machine Learning Models in Production

2021-03-02T00:00:00+00:00

After deploying many ML models in production, it became evident that there should be an easy and efficient way to monitor the ML models after deployment. This blog post is focused on monitoring the classification models in production.

Recently, I was working on the text classification problem which will classify the text into one of ~50 categories. Once the model is built and tested, it needs to be deployed as a flask API along with other models. Some text classification models are already deployed as an API that uses python flask to serve the incoming requests which use Gunicorn as a WSGI server and are deployed on Kubernetes clusters and trained models are stored in S3. So, the current architecture looks something like this, and a newly trained model needs to be deployed in this kind of setup

To prepare the training data, we have started with some keywords related to categories to tag the data, generate new keywords from existing categories, manually tagging data, etc to prepare the training dataset. Handling cases, where a text might fall into any of closely related categories and also considered already deployed classifiers to tag data in some of the categories on which they were trained on. Random manual check on labels was done to ensure if tagged entries using the above methods are good enough as training data is in the range of few millions.

Then a model is trained using training data and tested on a held-out set to measure the performance of the classifier, if model performance is acceptable on test data then the model is deployed in production as an API. Now as the model is deployed, then why there is a need for monitoring the model?

Need for monitoring

You have already put the classification model in production, which performed very well on your test dataset. But,

How do you know that the model is performing well on the new data?
How do you know it is time to retrain the model?
How to know the effect of data drift and conceptual drift on the model?

One possible way to detect the performance of the classifier is to take all incoming requests for classification for a certain duration and manually label the data and compare it with model predictions. Do this exercise periodically after a certain period of time to gauge the performance of the model. This is the same thing that we did while training the model and needs to be done periodically. This process seems unintuitive to me and requires a lot of manual effort and periodically monitoring of models which become very cumbersome when you have many classification models running in production.

There is a need for something simple, quick, and yet very intuitive which gives an idea about how models are performing in the production. Also in the current setup, how models are deployed and used by downstream applications is pretty stable, therefore don’t want to invest too much time in getting a look at available ML tools that provides this functionality out of the box.

Prediction Distribution of Model

After some research, having a dashboard that displays the plot for the prediction distribution of the incoming requests seems very intuitive to me and will also answer some of the questions, like:

How a model is performing against each category?
Does prediction distribution follow a similar pattern as training data?
Is the model biased towards any category i.e. model is predicting some class very often?
Is the model failing to predict any category?
Is there a need to retrain the model?

What else can be tracked?

A new model is trained to predict the text into one of the 52 categories and uses the BERT-base-cased model, so to deploy that in production and staging we have to increase the resources significantly in comparison to previous models so that model can run smoothly on CPU.

Generally, when it comes to deployment, there are two environments Staging/UAT and Prod, and there is a significant difference in these environments in terms of resources allocated to the application like memory, CPU time. The idea is to allocate more resources to the application in production so that it can serve its purpose without any issue. In our case also, in production number of workers running are 2X times as of staging environment, hence resources needed are also doubled. Therefore we want to know that do we really need the increased resources in production?

That’s why we want to track the number of API calls which will eventually answer a few questions like

Is there any need to increase or decrease the resources in production?
Can the whole system be deployed as batch inference in case the number of API calls is less?
Are previously deployed models still being used by downstream applications and if so, then how frequent?

There might be some more metrics we can measure like, the responsiveness of APIs. But here, the main focus was to know how the model was performing in predicting the categories and keep the effort very simple to track down these metrics.

Here is the sample code for generating divs for Bar plot.

import plotly
import plotly.express as px

def generate_div(prediction_distribution):
    """
    function to generate div html tags from model prediction distribution dictionary.
    :param prediction_distribution: dictionary with keys as model name and its values as a dictionary having 
    its classes and values. It should look like:
    {'1.0': {'Class 1': 23,
         'Class 2': 19,
         'Class 3: 40},
    '2.0': {'Category 1': 10,
         'Category 2': 42,
         'Category 3': 23,
         'Category 4': 20,
         },
    '3.0': {'Class A': 10,
         'Class B': 23,
         'Class C': 12,
         }}
    :type prediction_distribution: Dictionary
    :return: html div tags
    :rtype: list of div tags
    """
    divs = []
    for version in prediction_distribution:
        the_dict = {'Intent_categories':[], 'Values':[]}
        the_dict['Intent_categories'] = list(prediction_distribution[version].keys())
        the_dict['Values'] = [prediction_distribution[version][i] for i in the_dict['Intent_categories']]
        fig = px.bar(the_dict, x='Intent_categories', y='Values', color='Values',title="Class prediction distribution for model %s"%version)
        fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide', xaxis_tickangle=45)
        divs.append(plotly.io.to_html(fig, include_plotlyjs=False, full_html=False))
    return divs

Below, HTML template can be rendered from the Flask module by passing the HTML divs generated by the above code.

@app.route("/monitoring", methods=['GET'])
def monitor():
    divs = generate_div(predict_dist)
    return render_template('monitor.html', div1=Markup(divs[0]),   div2=Markup(divs[1]), div3=Markup(divs[2]))

   
      Model Monitoring
       style="width:100%">

Please let me know if you like the post, or have some suggestions/concerns and feel free to reach out to me on LinkedIn.

References:

https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/
https://mlinproduction.com/value-propositions-ml-monitoring-system/
https://www.explorium.ai/blog/understanding-and-handling-data-and-concept-drift/

Wilson Lower bound Score and Bayesian Approximation for K star scale rating to Rate products

2020-01-16T00:00:00+00:00

As a maintainer of an online community, which is having a lot of products where user gives a rating to products based on their experience, then it is definite that at some point you have to find an answer to questions like

How you are going to show the product on the page based on filters i.e. like highest voted or lowest voted, etc.? Or
How can you rate a product based on upvotes and downvotes?
How you can give a score to a product which is rated on a K scale by users?

There are some ways you can find a score and rate products accordingly:

Score = Average rating of products
Score = Positive rating - Negative rating
Score = Proportion of Positive ratings

Evan Miller’s famous blog How not to sort, explains why the above two scores are not good ways to rate the product or sort a product.

Lower bound of Wilson score confidence interval for a Bernoulli parameter provides a way to sort a product based on positive and negative ratings.

The idea here is to treat the existing set of user ratings as a statistical sampling of a hypothetical set of user ratings from all users and then use this score. In other words, what user community would think about upvoting a product with 95% confidence given that we have an existing rating for this product with a sample (subset from the whole community) user ratings.

Therefore if we know what a sample population thinks i.e. user reviews for a product, you can use this to estimate the preferences of the whole community.

If there are X positive votes and Y negative votes for a product and we want to understand how popular the product will be across the whole community. We can estimate that with 95% confidence between wilson_lower_bound_score and wilson_upper_bound_score% of users will upvote this product using Wilson Score of confidence interval.

Wilson Score

where, $\hat{p}$=(# of positive ratings)/(Total ratings)
$n$ = Total ratings
$z_{α/2}$= quantile of the standard normal distribution

import math
import scipy.stats as st

def wilson_lower_bound(pos, n, confidence=0.95):
    """
    Function to provide lower bound of wilson score
    :param pos: No of positive ratings
    :param n: Total number of ratings
    :param confidence: Confidence interval, by default is 95 %
    :return: Wilson Lower bound score
    """
    if n == 0:
        return 0
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    phat = 1.0 * pos / n
    return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n)

Wilson Confidence Interval considers binomial distribution for score calculation i.e. it considers only positive and negative ratings. If your product is rated on 5 scale rating, then we can convert ratings {1-3} into negative and {4,5} to positive rating and can calculate wilson score.

Lets look at some examples:

If a product is rated across each category uniformly [10, 10, 10, 10, 10], i.e. 10 votes for rating {1-5}, then wilson_lower_bound(20,50,.95), avg_rating([10, 10, 10, 10, 10]) => (0.2760838973025655, 3.0)
A product receives only one rating i.e. positive and one product receives 10 positive and 2 negative ratings: in that case value of product having more ratings should be greater wilson_lower_bound(1,1,0.95) < wilson_lower_bound(10,12,0.95) , which is true.
Product having ratings A: (209 up and 50 down votes) and B: (118 up and 25 down) wilson_lower_bound(209,259,0.95) < wilson_lower_bound(118,143,0.95)
Suppose one product receives [5, 10, 20, 0, 0] ratings, then wilson_lower_bound(0,35,0.95) = 0, If any product does not have any positive ratings associated with it then the Wilson score is zero.
Wilson Score can not be applied to new product which is yet to receive any rating, if using above implementation wilson_lower_bound(0,0,0.95) = 0.

Wilson score gives us the zero value for both the product which does not received any positive user rating and to product which is new and yet to receive any rating, which essentially does not make any sense as this implies no user rated product is same as product having lower ratings. Also, it is not clear how tight the lower bound is i.e., how far it deviates away from the “real” proportion of thumb-ups [1]. It does not seem intuitive to convert items rated on five star scale to convert to up votes and down votes for calculating scores to follow binomial distribution.

Bayesian Approximation

Bayesian Approximation provides a way to give a score to product when they are rated on star scale.

where, $s_k=k$ (That is, 1 point, 2 points, ….)
$N$ = total ratings, with $n_k$ ratings for $k^{th}$ scale

The above expression provides the lower bound of a normal approximation to a Bayesian credible interval for the average rating. For more mathematical details please check [4].

import math
import scipy.stats as st

def bayesian_rating_products(n, confidence=0.95):
    """
    Function to calculate wilson score for N star rating system. 
    :param n: Array having count of star ratings where ith index represent the votes for that category i.e. [3, 5, 6, 7, 10]
    here, there are 3 votes for 1-star rating, similarly 5 votes for 2-star rating. 
    :param confidence: Confidence interval
    :return: Score
    """
    if sum(n)==0:
        return 0
    K = len(n)
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    N = sum(n)
    first_part = 0.0
    second_part = 0.0
    for k, n_k in enumerate(n):
        first_part += (k+1)*(n[k]+1)/(N+K)
        second_part += (k+1)*(k+1)*(n[k]+1)/(N+K)
    score = first_part - z * math.sqrt((second_part - first_part*first_part)/(N+K+1))
    return score

Lets look at some test samples

bayesian_rating_products([0, 0, 0, 0, 1]) = 2.2290
bayesian_rating_products([0, 2, 0, 10, 0]) = 2.9921
bayesian_rating_products([5, 10, 20, 0, 0]) = 2.2349
bayesian_rating_products([10, 10, 10, 10, 10]) = 2.6296

On comparing 1 and 2, we can observe the second product should have a higher score and that is the case here. Also in point 3, unlike Wilson score it provides a score to the product which does not have a positive rating and still that score is greater than the first product which seems reasonable to me.

Bayesian Approximation does not consider only upvotes unlike Wilson’s score but considers ratings across the K scale and proves to be better in this scenario.

Please let me know if you like the post, or have some suggestions/concerns in comments and feel free to reach out to me on LinkedIn.

References

How to Count Thumb-Ups and Thumb-Downs: User-Rating based Ranking of Items from an Axiomatic Perspective
Star Quality: Aggregating Reviews to Rank Products and Merchants
http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
http://www.evanmiller.org/ranking-items-with-star-ratings.html

Maximal Marginal Relevance to Re-rank results in KeyPhrase Extraction

2019-10-13T00:00:00+00:00

Maximal Marginal Relevance a.k.a. MMR has been introduced in this paper The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. MMR tries to reduce the redundancy of results while at the same time maintaining query relevance of results for already ranked documents/phrases etc.

We first try to understand the scenario by taking an example and will see how MMR is helpful in solving the issue.

Recently I was trying to extract KeyPhrases from a set of documents that belongs to one category. I have used different approaches (TextRank, RAKE, POS tagging, etc.. to name a few) to extract keywords from the documents, which provides phrases along with score. This score is used as the ranking of the phrases for that document.

Let’s say your final keyPhrases are ranked like Good Product, Great Product, Nice Product, Excellent Product, Easy Install, Nice UI, Light weight etc. But there is an issue with this approach, all the phrases like good product, nice product, excellent product are similar and define the same property of the product and are ranked higher. Suppose we have a space to show just 5 keyPhrases, in that case, we don’t want to show all these similar phrases.

You want to properly utilize this limited space such that the information displayed by the Keyphrases about the documents is diverse enough. Similar types of phrases should not dominate the whole space and users can see a variety of information about the document.

We are going to address this problem in this blog post. There might be different approaches to solve this problem. For the sake of simplicity and completeness of the article, I am going to discuss two approaches:

Remove redundant phrases using cosine similarity

To use cosine similarity is the naive approach that came to mind to deal with terms having the same meaning. Use word embeddings to find embeddings of phrases and find cosine similarity between embeddings. Set a threshold above which you will consider the terms as similar. Just take one keyPhrase having more score out of clubbed phrases in the result.

 from sklearn.metrics.pairwise import cosine_similarity
 def club_similar_keywords(emb_mat, sim_score=0.9):
    """
    :param emb_mat: matrix having vectors with words as index
    :param sim_score: 0.9 by default
    :return: returns list of unique words from index after combining words which has similarity score of more than
    0.9
    """
    if len(emb_mat) == 0:
        return 'NA'
    xx = cosine_similarity(emb_mat)
    final_keywords = set(emb_mat.index)
    N = len(emb_mat.index)
    dd = {}
    for i in range(N):
        for j in range(N):
            if (float(xx[i][j]) > sim_score) and (i != j):
                try:
                    dd[emb_mat.index[i]].append(emb_mat.index[j])
                except:
                    dd[emb_mat.index[i]] = []
                    dd[emb_mat.index[i]].append(emb_mat.index[j])
    removed_keywords = []
    for key in dd:
        for val in dd[key]:
            if key not in removed_keywords:
                removed_keywords += dd[key]
                try:
                    final_keywords.remove(val)
                except:
                    pass
    return final_keywords

An issue with this approach is that you need to set the threshold (0.9 in code) above which, terms will be clubbed together. And sometimes very close keywords might have cosine similarity < threshold. Word embeddings have been used to convert the sentence to vector by averaging word tokens. Keeping the threshold low will lead to dealing with the same issue again. I find it difficult to manually tweaking this threshold to include all edge cases.

Re-Ranking the KeyPhrases using MMR

The idea behind using MMR is that it tries to reduce redundancy and increase diversity in the result and is used in text summarization. MMR selects the phrase in the final keyphrases list according to a combined criterion of query relevance and novelty of information.

The latter measures the degree of dissimilarity between the document being considered and previously selected ones already in the ranked list. [1]

MMR ranking provides a useful way to present information to the user that is not redundant. It considers the similarity of keyphrase with the document, along with the similarity of already selected phrases.

where, Q = Query (Description of Document category)
D = Set of documents related to Query Q
S = Subset of documents in R already selected
R\S = set of unselected documents in R
$\lambda$ = Constant in range [0 - 1], for diversification of results

In the below implementation of MMR, cosine similarity has been considered as $Sim_1$ and $Sim_2$.

Any other similarity measure can be taken and the function can be modified accordingly.

from sklearn.metrics.pairwise import cosine_similarity
def maximal_marginal_relevance(sentence_vector, phrases, embedding_matrix, lambda_constant=0.5, threshold_terms=10):
    """
    Return ranked phrases using MMR. Cosine similarity is used as similarity measure.
    :param sentence_vector: Query vector
    :param phrases: list of candidate phrases
    :param embedding_matrix: matrix having index as phrases and values as vector
    :param lambda_constant: 0.5 to balance diversity and accuracy. if lambda_constant is high ,      then higher accuracy. If lambda_constant is low then high diversity.
    :param threshold_terms: number of terms to include in result set
    :return: Ranked phrases with score
    """
    # todo: Use cosine similarity matrix for lookup among phrases instead of making call everytime.
    s = []
    r = sorted(phrases, key=lambda x: x[1], reverse=True)
    r = [i[0] for i in r]
    while len(r) > 0:
        score = 0
        phrase_to_add = ''
        for i in r:
            first_part = cosine_similarity([sentence_vector], [embedding_matrix.loc[i]])[0][0]
            second_part = 0
            for j in s:
                cos_sim = cosine_similarity([embedding_matrix.loc[i]], [embedding_matrix.loc[j[0]]])[0][0]
                if cos_sim > second_part:
                    second_part = cos_sim
            equation_score = lambda_constant*(first_part-(1-lambda_constant) * second_part)
            if equation_score > score:
                score = first_part - (1 - lambda_constant) * second_part
                phrase_to_add = i
        if phrase_to_add == '':
            phrase_to_add = i
        r.remove(phrase_to_add)
        s.append((phrase_to_add, score))
    return (s, s[:threshold_terms])[threshold_terms > len(s)]

Setting $\lambda$ to 0.5 gives the optimal mix of diversity and accuracy in the result set. The value of $\lambda$ can be set based on the use-case and your dataset.

MMR helps to address the issue by ranking similar phrases far away. So the issue to select top N keyPhrase has been resolved as all similar terms are not grouped and don’t appear in the final result.

Please let me know if you like the post, or have some suggestions/concerns.

References:

Deep Averaging Network in Universal Sentence Encoder

2019-09-10T00:00:00+00:00

Word embeddings are now state of art for doing downstream NLP tasks such as text classification, sentiment analysis, sentence similarity etc. and provides very good results compared to tf-idf or count vectorizer. Using word embeddings we can find the similarity between words and can apply vector operations and therefore can easily distinguish between cat, dog, car. Here cat and dog will be more similar compared to car.

But obtaining vectors for sentences is not immediate obvious. This post tries to explain one of the approaches described in Universal Sentence Encoder.

Deep averaging network (DAN): Idea of DAN is described in this paper Deep Unordered Composition Rivals Syntactic Methods for Text Classification

Word embeddings are low dimensional vector in N dimensional space which describe a word. To obtain vector space model for sentences or documents, appropriate composition function is required. Composition function is mathematical process of combining multiple words into single vector.

Composition functions are of two types

Unordered: Treats as bag of word embeddings
Syntactic: Takes word order and sentence structure into account.

Syntactic functions outperform unordered functions on many tasks but at same time it is compute expensive and requires more training time.

Deep unordered model that obtains near state of art accuracy on sentence and document level tasks with very less training time works in three steps:

Take the vector average of the embeddings associated with an input sequence of tokens
Pass that average through one or more feed-forward layer
Perform (linear) classification on the final layers representation
Loss function is cross entropy.

Two important observations described in this paper are

Accuracy can be improved by using a variant of dropout, which randomly drops some of words embeddings before averaging i.e. dropout inspired regularizer.
The choice of composition function is not as important as initializing with pre-trained embeddings and using a deep network.

Here best of both the approaches are taken i.e. training speed of unordered function and accuracy of syntactic functions. DAN takes very less training time with slightly less accuracy on compared to other approach i.e. transformer encoder.

Observations on Results:

Randomly dropping out 30% of words from the vector average is optimal for the quiz bowl task and results in 3% improved accuracy, which indicates that p = 0:3 is a good baseline to start with.
DANs achieve comparable sentiment accuracy to syntactic functions and are trained in very lesser time compared to syntactic functions as RecNN.
2–3 layers achieves good result for binary sentiment analysis task, but adding more depth is an improvement to shallow Neural bag of word model
Sometimes it is very important to consider the ordering of words in NLP. Man bites dog and Dog bites man are two different sentences, but as we are just averaging the embeddings, those differentiation in sentences will be missed.
Also DAN performed poorly on double negation sentences like this movie was not bad. But at the same time DRecNN is slightly better in terms of polarity.

On checking similarity of sentences this is toy dog and this is dog toy, DAN encoding of both of these sentences should be same as number of words are same and ordering should not matter, but it turns out that they are not same.

This might be due to word dropout while averaging during feed forward pass of DAN.

Colab notebook can be accessed here.

References:

Universal Sentence Encoder
Deep Unordered Composition Rivals Syntactic Methods for Text Classification
https://github.com/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb

Parameters to consider for Evaluation Metric to compare Forecasting results

2019-08-10T00:00:00+00:00

This post describes the issues that I have faced in designing the evaluation metric for forecast results, the type of complexity associated with it and how difficult it can be to come up with a single metric or number for comparing two forecasts over a period.

Consider a situation in the bank, every day you need to forecast cash withdrawal in ATM so that based on the demand and availability of cash you can schedule a trip and fill that ATM to avoid cash-out.

This is a general problem that every bank faces and everyone has a solution of their own. But your organization uses a proprietary solution from some company that charges $X yearly as a licensing fee which is too much for that product.

To avoid heavy charges and dependency on this product, the company decided to replace that system and want to use open source tools or libraries.

So far so good.

You started developing the solution for all ATMs using open source libraries and tools. And after some time you developed the solution which scale well to distributed environment and provides forecast.

Now the time has come to compare the forecast with actual and also with existing solutions lets say on for month data. These are some points that need to be considered

What is the metric you will compare the results on? will it be RMSE(root means square error), MAE (mean absolute error)or MAPE (mean absolute percentage error)?
Now for some days, your open-source model will give good results in terms of metrics that you might have chosen but at the same time, the existing solution will give better results for some days. How collectively you can say which one is better?
To check the robustness of the model, compare the results of the model on special events like public holidays, new year, US public holidays, Diwali, Christmas, etc. Because these are the days where abnormal patterns are generally observed and it is a good chance to see how your model behaves in these extreme events.
Can you divide the days into peak days i.e. where your problem has much impact on business and non-peak days and check the performance of the model?
How many under predictions and how many over predictions are there for each model? Are over predictions are acceptable by a business or under predictions at acceptable to your problem?
If under prediction/ over-prediction are acceptable, then by what magnitude?
What about the case where due to one or two very high or very low prediction whole month metric (i.e. MAPE) goes very high, but if you remove these outliers from comparisons forecast is close to actual.

Suppose you come up with all the answers to the above questions, then how to come up with one metric that combines all the above points?

Remember we are talking this only for 1 ATM machine. What about once we will consider let say 1000s of machines? And how about different denominations like $5, $10 and $100 that a machine can have.

Consider yourself as a business person who has the authority to take the decision to decommission the existing solution and start using the solution that you developed. Before taking any decision you want to check how the overall new solution is behaving when compared to actual and existing solution because very high risk is involved if you make a decision without considering all these factors.

So the point here is in this type of scenario all these types of issues need to be considered and it is very hard to say about the quality of forecast considering only a few of the above points. But again it becomes even more difficult once you start considering all the points to say about the quality of the forecast.