This post describes the issues that I have faced in designing the evaluation metric for forecast results, the type of complexity associated with it and how difficult it can be to come up with a single metric or number for comparing two forecasts over a period.
Consider a situation in the bank, every day you need to forecast cash withdrawal in ATM so that based on the demand and availability of cash you can schedule a trip and fill that ATM to avoid cash-out.
This is a general problem that every bank faces and everyone has a solution of their own. But your organization uses a proprietary solution from some company that charges $X yearly as a licensing fee which is too much for that product.
To avoid heavy charges and dependency on this product, the company decided to replace that system and want to use open source tools or libraries.
So far so good.
You started developing the solution for all ATMs using open source libraries and tools. And after some time you developed the solution which scale well to distributed environment and provides forecast.
Now the time has come to compare the forecast with actual and also with existing solutions lets say on for month data. These are some points that need to be considered
- What is the metric you will compare the results on? will it be RMSE(root means square error), MAE (mean absolute error)or MAPE (mean absolute percentage error)?
- Now for some days, your open-source model will give good results in terms of metrics that you might have chosen but at the same time, the existing solution will give better results for some days. How collectively you can say which one is better?
- To check the robustness of the model, compare the results of the model on special events like public holidays, new year, US public holidays, Diwali, Christmas, etc. Because these are the days where abnormal patterns are generally observed and it is a good chance to see how your model behaves in these extreme events.
- Can you divide the days into peak days i.e. where your problem has much impact on business and non-peak days and check the performance of the model?
- How many under predictions and how many over predictions are there for each model? Are over predictions are acceptable by a business or under predictions at acceptable to your problem?
- If under prediction/ over-prediction are acceptable, then by what magnitude?
- What about the case where due to one or two very high or very low prediction whole month metric (i.e. MAPE) goes very high, but if you remove these outliers from comparisons forecast is close to actual.
Suppose you come up with all the answers to the above questions, then how to come up with one metric that combines all the above points?
Remember we are talking this only for 1 ATM machine. What about once we will consider let say 1000s of machines? And how about different denominations like $5, $10 and $100 that a machine can have.
Consider yourself as a business person who has the authority to take the decision to decommission the existing solution and start using the solution that you developed. Before taking any decision you want to check how the overall new solution is behaving when compared to actual and existing solution because very high risk is involved if you make a decision without considering all these factors.
So the point here is in this type of scenario all these types of issues need to be considered and it is very hard to say about the quality of forecast considering only a few of the above points. But again it becomes even more difficult once you start considering all the points to say about the quality of the forecast.
Comments