Evaluation Metrics for Recommendation Systems

A summary of research on performance evaluation of recommendation systems.

Classification and Usage

Regression Accuracy

Used for predicting user ratings
- Predicting ratings on scales of 1~5 or 1~10
Can also be solved as a classification problem, but since there is ordinality, it can be solved as a regression problem
- Predicting continuous values is more useful for subsequent sorting, etc.

Classification Accuracy (Mainly Binary Classification)

Used for predicting whether a user has rated, purchased, or viewed something
- Predicting 0 or 1, like or dislike, did or didn't

List Relevance

Used for evaluating recommended lists
Older metrics that consider list ordering but not position
- Consider evaluating a recommendation of 5 versus 4
- In list-based recommendations, items with lower ranks should be weighted relatively higher, but the metrics introduced here cannot make this distinction
- The "Ranked List Relevance" metrics introduced next can distinguish between them (the former scores worse, the latter scores better)

Ranked List Relevance

Used for evaluating recommended lists
Newer metrics that consider both list ordering and position
Impose large penalties for ordering errors at the top of the list

Others

Newer evaluation metrics used for evaluating recommendation lists and the overall recommendation system
- Item/user coverage rate
- Recommending new items, penalizing recommendations that are not purchased, etc.
The challenge is that data for evaluation is difficult to obtain

Regression Accuracy

Used for evaluating ratings
Whether the system has the ability to accurately predict user ratings
The difference between evaluation values in test data and predicted evaluation values
In evaluation browsing tasks (such as sorting by rating), users see the evaluation values themselves
- In such cases, emphasis should be placed on the prediction accuracy of evaluation values

MAE (Mean Absolute Error)

\text{MAE} = \frac{1}{N} \sum^{N}_{i=1} |\hat{r_{i}} - r_{i}|

MSE (Mean Squared Error)

\text{MSE} = \frac{1}{N} \sum^{N}_{i=1} (\hat{r_{i}} - r_{i})^{2}

$MAE@{k}$ , $RMSE@{k}$

In practice, only the top few items with the highest predicted values are shown to users
Calculate MAE and RMSE for the top K items (TopK) for evaluation

\text{MAE@k} = \frac{1}{K} \sum^{K}_{i=1} \text{MAE}_{i}

Classification Accuracy (Mainly Binary Classification)

Used for binary evaluation such as whether a click or purchase occurred

Accuracy (Acc)

The proportion of cases where the user's interest matched between the predicted result and the test data
For evaluation values, whether the rating is high or low, etc.

Precision, Recall, F-measure, ROC

Precision = \frac{\text{Number of relevant items in the recommendation}}{\text{Total number of items in the recommendation}}

Recall = \frac{\text{Number of relevant items in the search results}}{\text{Total number of relevant items}}

For relevant item discovery tasks where finding one relevant item is sufficient, precision is important
- Measures how few recommendation errors there are
- The proportion of actually relevant items among the top n recommended items
For relevant item enumeration tasks where all relevant items need to be found, recall is emphasized
- Measures how few recommendation omissions there are
- The proportion of relevant items that are included in the top n recommendations
For comprehensive judgment, use E-measure, F-measure, or ROC

F = \frac{1}{\frac{\alpha}{Precision}+\frac{1-\alpha}{Recall}}

When alpha = 0.5, this is the F1 score

n-points Interpolated Precision

Takes the average of Precision at N recall points
Typically, N=11 (recall 0.0, 0.1, 0.2, ..., 0.9, 1.0)

nAIP = \frac{1}{N}\sum_{i}\text{InterPrecision}_{i}

ROC AUC

\text{AUC}(f) = \frac{\sum_{t_{0} \in D^{0}} \sum_{t_{1} \in D^{1}} 1 \{ f(t_{0}) < f(t_{1}) \}}{|D^{0}| |D^{1}|}

where

$|f(t_{0})\lt f(t_{1})|$ takes 1 if $f(t0) \lt f(t1)$ , and 0 otherwise
D0 is the set of negative samples
D1 is the set of positive samples

Hit-rate

A metric for targeted unary data (for binary data)

Search for non-zero entries in the user's test data
Check if the item is included in the recommendation list
Calculate the hit-rate

\text{hit-rate} = \frac{\text{num}_{hits}}{n}

$\text{num}_{hits}$ : Number of items included in the recommendation list

$n$ : Total number of items across all users

List Relevance Metrics

Relatively older ranking metrics
Ordering is considered, but the rank position itself is often not considered

Rank Correlation

Spearman's Rank Correlation

r_{\text{Spearman}} = 1 - \frac{6}{n(n^{2} - 1)} \sum^{n}_{i=1} (x_{i} - y_{i})^{2}

$x_i$ : Rank of item i as ranked by the recommendation system

$y_i$ : Rank of item i as ranked by the user

The coefficient $\frac{6}{n(n^{2} - 1)}$ can be derived by manipulating the bivariate normal distribution.

Equals 1 if ranked correctly
The magnitude of the values does not matter

Kendall's Rank

Checks agreement between users regarding the magnitude relationship between two items
Total number of pairs is $n(n-1)/2$
If "User S's rating of A < User S's rating of B" AND "User T's rating of A < User T's rating of B", then P += 1
If "User S's rating of A > User S's rating of B" AND "User T's rating of A > User T's rating of B", then P += 1
Otherwise, Q += 1, and the following is calculated:

\tau = \frac{P - Q}{\frac{1}{2} n (n - 1)} = \frac{2P}{\frac{1}{2} n (n - 1)} - 1

NDMP (Normalized Distance-based Performance Measure)

C-: Number of contradictory preference relationships
- Cases where the user rates item A > item B, but the recommendation ranks item B > item A
- D-E in the table below
Cu: Number of identical preference relationships between the recommendation and the user
- B-C, etc. in the table below
Ci: Total number of preference relationships for item pairs between the recommendation and user rankings

NDMP = \frac{2C^{-}+C^{u}}{2C^{i}}

Rank	Recommend	User
1	A	A
2	B	B
3	C	C
4	D	E
5	E	D

\text{NDMP for the table above} = \frac{2 \cdot 1+9}{2 \cdot 10}=\frac{11}{20}=0.55

Evaluation Metrics for Ranked Recommendations

Evaluates how well the recommended items are ordered
Whether the N most relevant items are predicted in the correct order
When the goal is relevant item discovery, the recommendation system is expected to support decision-making
- In such cases, the magnitude of evaluation should be prioritized over the numerical value
It is difficult to obtain a complete ranking from the active user
The main issue here is the weighting of the ordering

AP (Average Precision)

In a ranked recommendation, precision is calculated each time a relevant item is found, and the arithmetic mean is taken

AP = \frac{1}{R}\sum^{n}_{r=1}Prec(r)I(r)

Here, Prec(r) is Precision, and I(r) is a function that returns 1 if the item at rank r is relevant, and 0 otherwise
However, this is simply a binary relevance-based evaluation
In practice, evaluation metrics based on partial or multi-valued relevance are used

MAP (Mean Average Precision)

For $MAP@{k}$ , the average of the top K APs is taken

MAP = \frac{1}{N}\sum^{N}_{i=1}AP_{i}

MRR

The average of the reciprocal rank of the first relevant item found

Create recommendation lists for all target users
Find at what position the first relevant item appears
Take the reciprocal of this rank (if the list contains no correct answer, use 0 instead of taking the reciprocal)
Average across all users

MRR = \frac{1}{N}\sum^{N}_{i=1}\frac{1}{\text{rank of the relevant item for user i}}

Differences in rank at lower positions have little impact on the metric

CG (Cumulative Gain, without discounting)

The prototype of DCG
Sums up the gain (score) of correctly identified items

CG = \sum^{n}_{r=1}g(r)

nDCG (Multi-valued Relevance-based Evaluation Metric)

When accumulating the number of relevant items, items ranked lower receive more discounting
- Multiply by coefficients that are larger for higher ranks and smaller for lower ranks, then sum
- This coefficient is called the discounted gain
A normalized value of the cumulative discounted gain, where gain represents value based on relevance, and the value decreases as rank gets lower
DCG (Discounted Cumulative Gain) is defined below
g(r) is the gain at rank r
- Uses the user's actual evaluation values
- Can also be calculated with arbitrary weighting such as 5 points for 1st, 4 points for 2nd, 3 points for 3rd, 2 points for 4th, etc.

DCG = \sum^{n}_{r=1}\frac{g(r)}{log_{2}(r+1)}

IDCG is the discounted gain for an ideal recommendation
The normalized nDCG is as follows:

nDCG = \frac{DCG}{IDCG}

ARHR (Average Reciprocal Hit Rank)

For binary data

ARHR = \frac{1}{n}\sum_{i \in Z_{l}}\frac{1}{pos(i)}

$Z_{l}$ : Non-zero items in the user's recommendation list

pos: Position of the item in the ranking

Half-life Utility Metric

Reduces the importance of items by order using a half-life parameter

R_{u} = \sum_{j}\frac{max(r_{u,j}-\text{default}, 0)}{2^{(j-1)/(\alpha -1)}}

$r\_{u, j}$ : The user u's actual rating of the item at rank j

default: Default evaluation value (usually the mean of evaluation values)

$\alpha$ : Parameter controlling the half-life

RBP (Rank Biased Precision)

Incorporates a user model

RBP = (1-p)\sum^{n}_{i=1}g_{i}p^{i-1}

$g_i$ : Degree of relevance of the item at rank i to the user's preference

p: Parameter modeling user persistence while browsing the ranked list (usually estimated from click logs)

ERR (Expected Reciprocal Rank)

Cascade user model
- A model that considers the relevance of items higher in the list
- Users typically stop searching once satisfied
- In this case, everything below the stopping point is not clicked
One of the cascade models
Sums the probability that the user stops browsing the recommended ranking at position r
This probability is influenced by higher-ranked positions

\begin{aligned} \text{ERR} &= \sum^{n}_{r=1} \zeta(r) P(\text{user stops at position } r) \\ \zeta(r) &= \frac{1}{r} \quad \text{or} \quad \zeta(r) = \frac{1}{\log_{2}(r+1)} \end{aligned}

Or alternatively:

\begin{aligned} \text{ERR} &= \sum^{n}_{r=1} \zeta(r) \prod^{r-1}_{i=1} (1 - R_{i}) R_{r} \\ R_{i} &= \frac{2^{g} - 1}{2^{g_{\text{max}}}} \end{aligned}

g: Degree of relevance to the user's preference

$g_{max}$ : Maximum grade on the scale (e.g., 5 on a 1-5 rating scale)

Q-measure, GAP (Graded Average Precision)

An extension of average precision

Non-accuracy Evaluation Metrics

Metrics that are useful to users but not exploration-oriented
- Coverage, learning rate, confidence, reliability
Exploration-oriented metrics useful to users
- Measuring whether recommendations are novel to the user
- Serendipity metrics
- Novelty metrics
- Diversity metrics

Useful Non-exploration Metrics

Coverage

Evaluates how well the recommendation content covers users and items
- In collaborative filtering, predictions cannot be made for items that have not been rated by users
- In content-based filtering, items with missing features cannot be predicted
There is a trade-off with prediction accuracy
- Recommending items that most customers purchase (such as milk and eggs) increases prediction accuracy but reduces coverage

Catalog Coverage

How many items can be recommended in a single recommendation out of all available items
Indicates the breadth of recommended items

\begin{aligned} \text{Catalogue Coverage} &= \frac{|S_{r}|}{|S_{a}|} \\ \text{where} \ \ S_{a}&\text{: all items} \\ S_{r}&\text{: items that can be recommended} \end{aligned}

Relevant catalog coverage, which limits the scope of recommendations to relevant items only:

\text{Weighted Catalogue Coverage} = \frac{|S_{r} \cap S_{s}|}{|S_{a}|}

User Coverage

Indicates how many users out of all target users received recommendations in the most recent recommendation cycle

\begin{aligned} \text{Prediction Coverage} &= \frac{|S_{p}|}{|S_{u}|} \\ \text{where} \ \ S_{p}&\text{: users for whom the system could recommend at least one item} \\ S_{u}&\text{: all users} \end{aligned}

Learning Rate

How quickly the recommendation system can provide appropriate recommendations after a user's preferences change
- Related to the cold start problem
- If the system cannot quickly provide appropriate recommendations, user satisfaction decreases
- Time elapsed after preference change
- It is necessary to continuously monitor accuracy from the time a user starts receiving recommendations

Confidence

How confident the system is in its recommendations
- Calculate how many users and items were used to build the recommendation system
  - Data coverage rate at system construction time
- Calculate the similarity of K-NN used in collaborative filtering predictions
  - When similarity is low, the K-NN distance becomes large, so this should be monitored

Reliability

User trust in the recommendation system
The recommendation system must demonstrate the ability to predict user preferences and interests
- Difficult to gain trust if items familiar to the user are not shown
- If users cannot trust the recommendation system, they will stop using it
- Trust increases when recommendation results are good
- Directly ask users through online evaluation
- Estimate through user usage frequency

Exploration-oriented Metrics

If only similar items are recommended, users will get bored with the recommendations
Recommending already-known items does not help with decision-making
Evaluation targets: user sets, item sets, item pairs, etc.
Required data: rating matrix (users x items), ontology (item categories), other novelty/serendipity-related datasets

Cautions

Recommending only novel items reduces user satisfaction
- Users generally prefer known items
- When evaluated comprehensively, novelty decreases user satisfaction
- Caution is needed when recommending novel items
Consider the user's experience with the recommendation system
- Explain the novelty of the recommendation system in advance
- Explain each recommended novel item

Serendipity

Whether the recommendation surprises the user
- (The user cannot search for the item on their own)
- Surprise when a specific item is recommended
- Calculated using other systems?

Unexpectedness

\begin{aligned} \text{EXEXP} &= \frac{1}{|L_{i}|} \sum^{|L_i|}_{i=1} \max(P(s_i) - \text{prim}(s_i), 0) \cdot \text{rel}(s_i) \\ \text{where} \ \ s_i&\text{: the i-th item} \\ L_i&\text{: the evaluated recommendation list} \\ \text{rel}(s_i)&\text{: user's rating of item }s_i \\ P(s_i)&\text{: predicted rating by the target system} \\ \text{prim}(s_i)&\text{: predicted rating by the primitive system} \end{aligned}

\begin{aligned} \text{UNEXP} &= \frac{\text{RS}}{\text{PM}} \\ \text{SRDP} &= \frac{|\text{UNEXP} \cap \text{USEFUL}|}{N} \end{aligned}

where

\begin{aligned} \text{RS}&\text{: set of items by the recommendation system} \\ \text{PM}&\text{: set of items by the primitive system} \\ \text{UNEXP}&\text{: set of unexpected items} \\ \text{USEFUL}&\text{: set of useful items} \end{aligned}

Entropy-based Diversity

Whether items recommended by one system are also recommended by other recommendation systems

\begin{aligned} \text{div}_{a,v} &= -\sum_{i \in L_{a,u \cap R_{u}}} p_{u,i} \log p_{u,i} \\ p_{u,i} &= \frac{\sum_{a \in A} \delta(a,u,i)}{|A|} \\ \text{where} \ \ A&\text{: set of recommendations} \\ a&\text{: evaluation by the target system} \\ L_{a,u}&\text{: recommendation list provided by system }a\text{ for user }u \\ R_u&\text{: items relevant to user }u \\ \delta&\text{: } 1 \text{ if } i \in L_{a,u} \cap R_u \text{ and } 0 \text{ otherwise} \end{aligned}

Novelty

Whether the recommended item is unknown to the user
Discovery ratio, Precision of novelty, Item novelty, Temporal novelty, Novelty based on HLU, Long tail metric, Generalized novelty model, Unexpectedness, Entropy-based diversity, Unserendipity, HLU of serendipity

How to Measure

Ask about the user's relationship with a specific item
Calculate general popularity or similarity between items
In the user x item rating matrix, values become 0 or 1 (known or unknown)

Discovery Ratio

How many items in the recommendation list are unknown to the user

\begin{aligned} \text{Discovery} &= \frac{|D_{i} \cap L_{i}|}{|L_{i}|} \\ \text{where} \ \ D_{i}&\text{: list of items unknown to user }i \\ L_{i}&\text{: recommendation list for user }i \end{aligned}

Precision of Novelty

\begin{aligned} \text{Precision(novelty)} &= \frac{|C_{i} \cap L_{i}|}{|L_{i}|} \\ \text{Recall(novelty)} &= \frac{|C_{i} \cap L_{i}|}{|C_{i}|} \\ \text{where} \ \ C_{i}&\text{: list of items unknown to and liked by user }i \\ L_{i}&\text{: recommendation list for user }i \end{aligned}

Item Novelty

Measures the novelty of recommended items
- Uses diversity within the list

\begin{aligned} \text{Nov}_{\text{item}}(i) &= N(\text{Div}(L) - \text{Div}(L - \{i\})) \\ &= \frac{1}{N-1} \sum_{j \in L} d(i,j) \\ \text{where} \ \ |L| &= N \\ d(i,j)&\text{: function measuring distance between items} \\ \text{Div}()&\text{: function measuring diversity of the recommendation list} \end{aligned}

Temporal Novelty

Measures whether recommendations differ from past recommendation lists

\begin{aligned} \text{Nov}_{\text{tmp}}(L_{i}) &= \frac{| \{ x \in L_{i} \mid x \notin S_{\text{past}} \} |}{|L_{i}|} \\ \text{where} \ \ S_{\text{past}} &= \bigcup^{i-1}_{j=1} L_{j} \end{aligned}

Half-life Based Novelty

Decreases the gain of an item's rating from the normal rating based on a parameter
- Items that are recommended but not purchased have their evaluation discounted
Half-life evaluation function

\begin{aligned} R_{u} &= \sum_{j} \frac{\max(r_{u,j} - \text{default}, 0)}{2^{(j-1)/(\alpha-1)}} \\ \text{where} \ \ r_{u,j}&\text{: item rated }j\text{ by user }u \\ \text{default}&\text{: default evaluation value (usually the mean)} \\ \alpha&\text{: half-life parameter} \end{aligned}

NHLU

\begin{aligned} f(i) &= \log_{2}\left(\frac{n}{n_{i}}\right) \\ \text{NHLU}_{u} &= \sum^{N}_{i} f(i) \frac{\max(r_{u,i} - \text{default}, 0)}{2^{(i-1)/(\alpha-1)}} \\ \text{where} \ \ n&\text{: number of users} \\ n_i&\text{: number of users who rated item }i \end{aligned}

Long-tail Based Novelty

Categorize items into three categories (HEAD, MID, TAIL) based on popularity
Extract items $a_i$ and $a_j$ from the top N recommendations
Evaluate the ability to recommend novel items using the following evaluation matrix

$ai→aj$	HEAD	MID	TAIL
HEAD	45%	55%	0%
MID	5%	70%	25%
TAIL	0%	15%	85%

If HEAD→HEAD is large, novelty is low
An indicator of whether popular items are being rotated

Diversity

Whether various types of items can be recommended to users
Aggregate diversity, Inter-user diversity, List personalization metric, Gini coefficient, Temporal diversity, Intra-list similarity, Subtopic retrieval, MMR, alpha-nDCG

Aggregate Diversity

Diversity of the recommendation system itself
Aggregates the types of items recommended to all users
A high value indicates that the recommendation system provides different items to users

\text{Aggdiv} = \left| \bigcup_{u \in U} L_{u} \right|

where

\begin{aligned} U&\text{: set of users} \\ L_{u}&\text{: list of items recommended to user }u \end{aligned}

Some research uses the total item set as the denominator to measure how many items out of all items were recommended to users
By replacing item types with subcategories/subtopics, category/topic diversity can also be measured (Subtopic retrieval metric)

Inter-user Diversity

Diversity across users
A mechanism for providing different recommendations to different users
How well the recommendation system is personalized for individual users (IUD)

\begin{aligned} d_{u,v} &= 1 - \frac{|L_{u} \cap L_{v}|}{N} \\ \text{IUD} &= \frac{1}{N_{U}} \sum_{u,v \in U} d_{u,v} \\ \text{where} \ \ U&\text{: user set} \\ N &= |L_{u}| = |L_{v}| \\ N_{U}&\text{: number of pairs in }U \\ L_{u}&\text{: list of items recommended to user }u \end{aligned}

List Personalization Metric

Whether the list is optimized for individuals

\begin{aligned} p_{b} &= \frac{U_{b}}{U} \\ I_{b} &= \log_{2}\left(\frac{|U|}{U_{b}}\right) \\ \text{Per}(L_{i}) &= \frac{\sum_{b_{j} \in L_{i}} \log \frac{|U|}{|U_{b}|}}{|L_{i}|} \\ \text{where} \ \ p_{b}&\text{: probability that item }b\text{ is selected by a user} \\ I_{b}&\text{: self-entropy of item }b \\ \text{Per}(L_{i})&\text{: degree of personalization of items} \\ U_{b}&\text{: number of users who selected item }b \end{aligned}

Gini Coefficient

Used as an indicator of income inequality, etc.
How biased the recommendation frequency of items is
x-axis: Percentage x% of items arranged in order of total frequency in the recommendation list
y-axis: Total frequency of x% of items in the list

\begin{aligned} G &= \frac{B}{A + B} \\ \text{where} \ \ A&\text{: area between the 45-degree line and the curve when graphed} \\ B&\text{: remaining area (area under the curve)} \end{aligned}

Temporal Diversity

Measurement when the system outputs different items at different times

\text{div}_{\text{tmp}}(L_{1}, L_{2}, N) = \frac{| \{ x \in L_{2} \mid x \notin L_{1} \} |}{N}

where $N = |L_{1}| = |L_{2}|$

Intra-list Diversity

Diversity based on the content of the recommendation list
Calculated using the similarity between two items
- Typically, feature vectors or categories are used

\text{Diversity} = \sum_{l \in L_{i}, m \in L_{i}} \frac{1}{\text{similarity}(l,m)}

where $L_{i}$ : list of items recommended to user $i$

\text{ILS}(L_{i}) = \frac{\sum_{b_{k} \in L_{i}} \sum_{b_{e} \in L_{i}, b_{k} \not= b_{e}} \text{similarity}(b_{k}, b_{e})}{\binom{|L_{i}|}{2}}

MMR (Maximal Marginal Relevance)

Originally a document selection method for search results
This concept is applied to evaluate recommendation result lists

\text{MMR}_{\text{List}} = \sum_{d_{i} \in L} (\alpha \cdot \text{rel}(d_{i}) - (1 - \alpha) \max_{d_{j} \in L^{i-1}} \text{sim}(d_{i}, d_{j}))

where

\begin{aligned} L&\text{: recommendation list} \\ d_{i}&\text{: item in the list} \\ \text{rel}&\text{: whether the user likes the item} \\ L^{i-1}&\text{: recommendation list up to the }j\text{-th item} \end{aligned}

Notes on Recommendation System Evaluation Metrics

AP and nDCG can evaluate ranked recommendations for specific users
When evaluating a system, it is necessary to evaluate the average quality across multiple users

MAP

The arithmetic mean of AP over a set of N users

\text{MAP} = \frac{1}{N} \sum^{N}_{i=1} \text{AP}_{i}

MAP alone is insufficient when the goal is to present reasonably relevant recommendations to as many users as possible

GMAP (Geometric Mean AP)

\text{GMAP} = \sqrt[n]{\prod_{n} \text{AP}_{n}}

For convenience, this can be transformed into the following form:

Where epsilon is a very small value (around 10^(-5)):

\text{GMAP} = \exp\left(\frac{1}{N} \sum^{N}_{i=1} \ln(\text{AP}_{i} + \epsilon)\right) - \epsilon

When recommendations for some users contain no relevant items, GMAP evaluates lower than MAP.

When the measurement range is short (e.g., top 10), MAP and GMAP values tend to be unstable

infAP (Inferred Average Precision)

Estimates the true average precision from incomplete test collections

Policy-based Evaluation Metrics

Distribution of specific item attributes
- When there are categories to promote, evaluation can be based on the distribution of specific attributes of the provided items
Frequency of NG item appearances
- The goal is basically 0
Transition rate to target items/categories
- Measured when controlling where users navigate to
Gini coefficient of page views per item
- Measured to see if bias is corrected when introducing mechanisms to prevent item view bias by the system

Interleaving

Alternately present new and old algorithms and evaluate which one receives better responses
https://www.wantedly.com/companies/wantedly/post_articles/155593
Can evaluate significant differences between algorithms more than 10 times faster than A/B testing
- Useful when the impact is small

Industry Status in 2017 (RecSys 2017 Conference)

RecSys_2017_Conference

At the recommendation systems conference, the metrics used by companies as of 2017, in order of frequency:

MRR (ranked recommendation list metric considering ordering, weighted by reciprocal of rank)
AUC (classification problem)
Precision (finding one relevant item)
nDCG (ranked recommendation list metric considering ordering, with adjustable weights)
Recall (enumerating all relevant items)

Although exploration-oriented evaluation metrics have been proposed in papers, in real environments the answer data is difficult to obtain, and they are not widely used in practice yet.
MRR and nDCG are relatively newer methods but are widely used.

References

About recommendation systems: http://www.kamishima.net/archive/recsysdoc.pdf
About evaluation metrics: https://techblog.nhn-techorus.com/archives/4778
Evaluation methods and metrics for search/recommendation: https://qiita.com/rilmayer/items/4c2c40640c0892e13257
Classification of evaluation metrics: https://blog.brainpad.co.jp/entry/2017/08/25/140000
Offline evaluation of recommendation systems: http://soc-research.org/wp-content/uploads/2014/11/OfflineTest4RS.pdf

Classification and Usage​

Regression Accuracy​

Classification Accuracy (Mainly Binary Classification)​

List Relevance​

Ranked List Relevance​

Others​

Regression Accuracy​

MAE (Mean Absolute Error)​

MSE (Mean Squared Error)​

MAE@kMAE@{k}MAE@k, RMSE@kRMSE@{k}RMSE@k​

Classification Accuracy (Mainly Binary Classification)​

Accuracy (Acc)​

Precision, Recall, F-measure, ROC​

n-points Interpolated Precision​

ROC AUC​

Hit-rate​

List Relevance Metrics​

Rank Correlation​

Spearman's Rank Correlation​

Kendall's Rank​

NDMP (Normalized Distance-based Performance Measure)​

Evaluation Metrics for Ranked Recommendations​

AP (Average Precision)​

MAP (Mean Average Precision)​

MRR​

CG (Cumulative Gain, without discounting)​

nDCG (Multi-valued Relevance-based Evaluation Metric)​

ARHR (Average Reciprocal Hit Rank)​

Half-life Utility Metric​

RBP (Rank Biased Precision)​

ERR (Expected Reciprocal Rank)​

Q-measure, GAP (Graded Average Precision)​

Non-accuracy Evaluation Metrics​

Useful Non-exploration Metrics​

Coverage​

Catalog Coverage​

User Coverage​

Learning Rate​

Confidence​

Reliability​

Exploration-oriented Metrics​

Cautions​

Serendipity​

Unexpectedness​

Entropy-based Diversity​

Novelty​

How to Measure​

Discovery Ratio​

Precision of Novelty​

Item Novelty​

Temporal Novelty​

Half-life Based Novelty​

Long-tail Based Novelty​

Diversity​

Aggregate Diversity​

Inter-user Diversity​

List Personalization Metric​

Gini Coefficient​

Temporal Diversity​

Intra-list Diversity​

MMR (Maximal Marginal Relevance)​

Notes on Recommendation System Evaluation Metrics​

MAP​

GMAP (Geometric Mean AP)​

infAP (Inferred Average Precision)​

Policy-based Evaluation Metrics​

Interleaving​

Industry Status in 2017 (RecSys 2017 Conference)​

References​

Classification and Usage

Regression Accuracy

Classification Accuracy (Mainly Binary Classification)

List Relevance

Ranked List Relevance

Others

Regression Accuracy

MAE (Mean Absolute Error)

MSE (Mean Squared Error)

$MAE@{k}$ , $RMSE@{k}$

Classification Accuracy (Mainly Binary Classification)

Accuracy (Acc)

Precision, Recall, F-measure, ROC

n-points Interpolated Precision

ROC AUC

Hit-rate

List Relevance Metrics

Rank Correlation

Spearman's Rank Correlation

Kendall's Rank

NDMP (Normalized Distance-based Performance Measure)

Evaluation Metrics for Ranked Recommendations

AP (Average Precision)

MAP (Mean Average Precision)

MRR

CG (Cumulative Gain, without discounting)

nDCG (Multi-valued Relevance-based Evaluation Metric)

ARHR (Average Reciprocal Hit Rank)

Half-life Utility Metric

RBP (Rank Biased Precision)

ERR (Expected Reciprocal Rank)

Q-measure, GAP (Graded Average Precision)

Non-accuracy Evaluation Metrics

Useful Non-exploration Metrics

Coverage

Catalog Coverage

User Coverage

Learning Rate

Confidence

Reliability

Exploration-oriented Metrics

Cautions

Serendipity

Unexpectedness

Entropy-based Diversity

Novelty

How to Measure

Discovery Ratio

Precision of Novelty

Item Novelty

Temporal Novelty

Half-life Based Novelty

Long-tail Based Novelty

Diversity

Aggregate Diversity

Inter-user Diversity

List Personalization Metric

Gini Coefficient

Temporal Diversity

Intra-list Diversity

MMR (Maximal Marginal Relevance)

Notes on Recommendation System Evaluation Metrics

MAP

GMAP (Geometric Mean AP)

infAP (Inferred Average Precision)

Policy-based Evaluation Metrics

Interleaving

Industry Status in 2017 (RecSys 2017 Conference)

References