Skip to main content

Evaluation Metrics for Recommendation Systems

A summary of research on performance evaluation of recommendation systems.

Classification and Usage

Regression Accuracy

  • Used for predicting user ratings
    • Predicting ratings on scales of 1~5 or 1~10
  • Can also be solved as a classification problem, but since there is ordinality, it can be solved as a regression problem
    • Predicting continuous values is more useful for subsequent sorting, etc.

Classification Accuracy (Mainly Binary Classification)

  • Used for predicting whether a user has rated, purchased, or viewed something
    • Predicting 0 or 1, like or dislike, did or didn't

List Relevance

  • Used for evaluating recommended lists
  • Older metrics that consider list ordering but not position
    • Consider evaluating a recommendation of 5 versus 4
    • In list-based recommendations, items with lower ranks should be weighted relatively higher, but the metrics introduced here cannot make this distinction
    • The "Ranked List Relevance" metrics introduced next can distinguish between them (the former scores worse, the latter scores better)

Ranked List Relevance

  • Used for evaluating recommended lists
  • Newer metrics that consider both list ordering and position
  • Impose large penalties for ordering errors at the top of the list

Others

  • Newer evaluation metrics used for evaluating recommendation lists and the overall recommendation system
    • Item/user coverage rate
    • Recommending new items, penalizing recommendations that are not purchased, etc.
  • The challenge is that data for evaluation is difficult to obtain

Regression Accuracy

  • Used for evaluating ratings

  • Whether the system has the ability to accurately predict user ratings

  • The difference between evaluation values in test data and predicted evaluation values

  • In evaluation browsing tasks (such as sorting by rating), users see the evaluation values themselves

    • In such cases, emphasis should be placed on the prediction accuracy of evaluation values

MAE (Mean Absolute Error)

MAE=1Ni=1Nri^ri\text{MAE} = \frac{1}{N} \sum^{N}_{i=1} |\hat{r_{i}} - r_{i}|

MSE (Mean Squared Error)

MSE=1Ni=1N(ri^ri)2\text{MSE} = \frac{1}{N} \sum^{N}_{i=1} (\hat{r_{i}} - r_{i})^{2}

MAE@kMAE@{k}, RMSE@kRMSE@{k}

  • In practice, only the top few items with the highest predicted values are shown to users
  • Calculate MAE and RMSE for the top K items (TopK) for evaluation
MAE@k=1Ki=1KMAEi\text{MAE@k} = \frac{1}{K} \sum^{K}_{i=1} \text{MAE}_{i}

Classification Accuracy (Mainly Binary Classification)

  • Used for binary evaluation such as whether a click or purchase occurred

Accuracy (Acc)

  • The proportion of cases where the user's interest matched between the predicted result and the test data

  • For evaluation values, whether the rating is high or low, etc.

Precision, Recall, F-measure, ROC

Precision=Number of relevant items in the recommendationTotal number of items in the recommendationPrecision = \frac{\text{Number of relevant items in the recommendation}}{\text{Total number of items in the recommendation}} Recall=Number of relevant items in the search resultsTotal number of relevant itemsRecall = \frac{\text{Number of relevant items in the search results}}{\text{Total number of relevant items}}
  • For relevant item discovery tasks where finding one relevant item is sufficient, precision is important
    • Measures how few recommendation errors there are
    • The proportion of actually relevant items among the top n recommended items
  • For relevant item enumeration tasks where all relevant items need to be found, recall is emphasized
    • Measures how few recommendation omissions there are
    • The proportion of relevant items that are included in the top n recommendations
  • For comprehensive judgment, use E-measure, F-measure, or ROC
F=1αPrecision+1αRecallF = \frac{1}{\frac{\alpha}{Precision}+\frac{1-\alpha}{Recall}}

When alpha = 0.5, this is the F1 score

n-points Interpolated Precision

  • Takes the average of Precision at N recall points
  • Typically, N=11 (recall 0.0, 0.1, 0.2, ..., 0.9, 1.0)
nAIP=1NiInterPrecisioninAIP = \frac{1}{N}\sum_{i}\text{InterPrecision}_{i}

ROC AUC

AUC(f)=t0D0t1D11{f(t0)<f(t1)}D0D1\text{AUC}(f) = \frac{\sum_{t_{0} \in D^{0}} \sum_{t_{1} \in D^{1}} 1 \{ f(t_{0}) < f(t_{1}) \}}{|D^{0}| |D^{1}|}

where

  • f(t0)<f(t1)|f(t_{0})\lt f(t_{1})| takes 1 if f(t0)<f(t1)f(t0) \lt f(t1), and 0 otherwise
  • D0 is the set of negative samples
  • D1 is the set of positive samples

Hit-rate

  • A metric for targeted unary data (for binary data)
  1. Search for non-zero entries in the user's test data
  2. Check if the item is included in the recommendation list
  3. Calculate the hit-rate
hit-rate=numhitsn\text{hit-rate} = \frac{\text{num}_{hits}}{n}

numhits\text{num}_{hits}: Number of items included in the recommendation list

nn: Total number of items across all users

List Relevance Metrics

  • Relatively older ranking metrics

  • Ordering is considered, but the rank position itself is often not considered

Rank Correlation

Spearman's Rank Correlation

rSpearman=16n(n21)i=1n(xiyi)2r_{\text{Spearman}} = 1 - \frac{6}{n(n^{2} - 1)} \sum^{n}_{i=1} (x_{i} - y_{i})^{2}

xix_i: Rank of item i as ranked by the recommendation system

yiy_i: Rank of item i as ranked by the user

The coefficient 6n(n21)\frac{6}{n(n^{2} - 1)} can be derived by manipulating the bivariate normal distribution.

  • Equals 1 if ranked correctly
  • The magnitude of the values does not matter

Kendall's Rank

  • Checks agreement between users regarding the magnitude relationship between two items

  • Total number of pairs is n(n1)/2n(n-1)/2

  • If "User S's rating of A < User S's rating of B" AND "User T's rating of A < User T's rating of B", then P += 1

  • If "User S's rating of A > User S's rating of B" AND "User T's rating of A > User T's rating of B", then P += 1

  • Otherwise, Q += 1, and the following is calculated:

τ=PQ12n(n1)=2P12n(n1)1\tau = \frac{P - Q}{\frac{1}{2} n (n - 1)} = \frac{2P}{\frac{1}{2} n (n - 1)} - 1

NDMP (Normalized Distance-based Performance Measure)

  • C-: Number of contradictory preference relationships
    • Cases where the user rates item A > item B, but the recommendation ranks item B > item A
    • D-E in the table below
  • Cu: Number of identical preference relationships between the recommendation and the user
    • B-C, etc. in the table below
  • Ci: Total number of preference relationships for item pairs between the recommendation and user rankings
NDMP=2C+Cu2CiNDMP = \frac{2C^{-}+C^{u}}{2C^{i}}
RankRecommendUser
1AA
2BB
3CC
4DE
5ED
NDMP for the table above=21+9210=1120=0.55\text{NDMP for the table above} = \frac{2 \cdot 1+9}{2 \cdot 10}=\frac{11}{20}=0.55

Evaluation Metrics for Ranked Recommendations

  • Evaluates how well the recommended items are ordered

  • Whether the N most relevant items are predicted in the correct order

  • When the goal is relevant item discovery, the recommendation system is expected to support decision-making

    • In such cases, the magnitude of evaluation should be prioritized over the numerical value
  • It is difficult to obtain a complete ranking from the active user

  • The main issue here is the weighting of the ordering

AP (Average Precision)

  • In a ranked recommendation, precision is calculated each time a relevant item is found, and the arithmetic mean is taken
AP=1Rr=1nPrec(r)I(r)AP = \frac{1}{R}\sum^{n}_{r=1}Prec(r)I(r)
  • Here, Prec(r) is Precision, and I(r) is a function that returns 1 if the item at rank r is relevant, and 0 otherwise
  • However, this is simply a binary relevance-based evaluation
  • In practice, evaluation metrics based on partial or multi-valued relevance are used

MAP (Mean Average Precision)

  • For MAP@kMAP@{k}, the average of the top K APs is taken
MAP=1Ni=1NAPiMAP = \frac{1}{N}\sum^{N}_{i=1}AP_{i}

MRR

  • The average of the reciprocal rank of the first relevant item found
  1. Create recommendation lists for all target users
  2. Find at what position the first relevant item appears
  3. Take the reciprocal of this rank (if the list contains no correct answer, use 0 instead of taking the reciprocal)
  4. Average across all users
MRR=1Ni=1N1rank of the relevant item for user iMRR = \frac{1}{N}\sum^{N}_{i=1}\frac{1}{\text{rank of the relevant item for user i}}
  • Differences in rank at lower positions have little impact on the metric

CG (Cumulative Gain, without discounting)

  • The prototype of DCG
  • Sums up the gain (score) of correctly identified items
CG=r=1ng(r)CG = \sum^{n}_{r=1}g(r)

nDCG (Multi-valued Relevance-based Evaluation Metric)

  • When accumulating the number of relevant items, items ranked lower receive more discounting

    • Multiply by coefficients that are larger for higher ranks and smaller for lower ranks, then sum
    • This coefficient is called the discounted gain
  • A normalized value of the cumulative discounted gain, where gain represents value based on relevance, and the value decreases as rank gets lower

  • DCG (Discounted Cumulative Gain) is defined below

  • g(r) is the gain at rank r

    • Uses the user's actual evaluation values
    • Can also be calculated with arbitrary weighting such as 5 points for 1st, 4 points for 2nd, 3 points for 3rd, 2 points for 4th, etc.
DCG=r=1ng(r)log2(r+1)DCG = \sum^{n}_{r=1}\frac{g(r)}{log_{2}(r+1)}
  • IDCG is the discounted gain for an ideal recommendation

  • The normalized nDCG is as follows:

nDCG=DCGIDCGnDCG = \frac{DCG}{IDCG}

ARHR (Average Reciprocal Hit Rank)

  • For binary data
ARHR=1niZl1pos(i)ARHR = \frac{1}{n}\sum_{i \in Z_{l}}\frac{1}{pos(i)}

ZlZ_{l}: Non-zero items in the user's recommendation list

pos: Position of the item in the ranking

Half-life Utility Metric

  • Reduces the importance of items by order using a half-life parameter
Ru=jmax(ru,jdefault,0)2(j1)/(α1)R_{u} = \sum_{j}\frac{max(r_{u,j}-\text{default}, 0)}{2^{(j-1)/(\alpha -1)}}

r_u,jr\_{u, j}: The user u's actual rating of the item at rank j

default: Default evaluation value (usually the mean of evaluation values)

α\alpha: Parameter controlling the half-life

RBP (Rank Biased Precision)

  • Incorporates a user model
RBP=(1p)i=1ngipi1RBP = (1-p)\sum^{n}_{i=1}g_{i}p^{i-1}

gig_i: Degree of relevance of the item at rank i to the user's preference

p: Parameter modeling user persistence while browsing the ranked list (usually estimated from click logs)

ERR (Expected Reciprocal Rank)

  • Cascade user model
    • A model that considers the relevance of items higher in the list
    • Users typically stop searching once satisfied
    • In this case, everything below the stopping point is not clicked
  • One of the cascade models
  • Sums the probability that the user stops browsing the recommended ranking at position r
  • This probability is influenced by higher-ranked positions
ERR=r=1nζ(r)P(user stops at position r)ζ(r)=1rorζ(r)=1log2(r+1)\begin{aligned} \text{ERR} &= \sum^{n}_{r=1} \zeta(r) P(\text{user stops at position } r) \\ \zeta(r) &= \frac{1}{r} \quad \text{or} \quad \zeta(r) = \frac{1}{\log_{2}(r+1)} \end{aligned}

Or alternatively:

ERR=r=1nζ(r)i=1r1(1Ri)RrRi=2g12gmax\begin{aligned} \text{ERR} &= \sum^{n}_{r=1} \zeta(r) \prod^{r-1}_{i=1} (1 - R_{i}) R_{r} \\ R_{i} &= \frac{2^{g} - 1}{2^{g_{\text{max}}}} \end{aligned}

g: Degree of relevance to the user's preference

gmaxg_{max}: Maximum grade on the scale (e.g., 5 on a 1-5 rating scale)

Q-measure, GAP (Graded Average Precision)

  • An extension of average precision

Non-accuracy Evaluation Metrics

  • Metrics that are useful to users but not exploration-oriented
    • Coverage, learning rate, confidence, reliability
  • Exploration-oriented metrics useful to users
    • Measuring whether recommendations are novel to the user
    • Serendipity metrics
    • Novelty metrics
    • Diversity metrics

Useful Non-exploration Metrics

Coverage

  • Evaluates how well the recommendation content covers users and items
    • In collaborative filtering, predictions cannot be made for items that have not been rated by users
    • In content-based filtering, items with missing features cannot be predicted
  • There is a trade-off with prediction accuracy
    • Recommending items that most customers purchase (such as milk and eggs) increases prediction accuracy but reduces coverage

Catalog Coverage

  • How many items can be recommended in a single recommendation out of all available items
  • Indicates the breadth of recommended items
Catalogue Coverage=SrSawhere  Sa: all itemsSr: items that can be recommended\begin{aligned} \text{Catalogue Coverage} &= \frac{|S_{r}|}{|S_{a}|} \\ \text{where} \ \ S_{a}&\text{: all items} \\ S_{r}&\text{: items that can be recommended} \end{aligned}
  • Relevant catalog coverage, which limits the scope of recommendations to relevant items only:
Weighted Catalogue Coverage=SrSsSa\text{Weighted Catalogue Coverage} = \frac{|S_{r} \cap S_{s}|}{|S_{a}|}

User Coverage

  • Indicates how many users out of all target users received recommendations in the most recent recommendation cycle
Prediction Coverage=SpSuwhere  Sp: users for whom the system could recommend at least one itemSu: all users\begin{aligned} \text{Prediction Coverage} &= \frac{|S_{p}|}{|S_{u}|} \\ \text{where} \ \ S_{p}&\text{: users for whom the system could recommend at least one item} \\ S_{u}&\text{: all users} \end{aligned}

Learning Rate

  • How quickly the recommendation system can provide appropriate recommendations after a user's preferences change
    • Related to the cold start problem
    • If the system cannot quickly provide appropriate recommendations, user satisfaction decreases
    • Time elapsed after preference change
    • It is necessary to continuously monitor accuracy from the time a user starts receiving recommendations

Confidence

  • How confident the system is in its recommendations
    • Calculate how many users and items were used to build the recommendation system
      • Data coverage rate at system construction time
    • Calculate the similarity of K-NN used in collaborative filtering predictions
      • When similarity is low, the K-NN distance becomes large, so this should be monitored

Reliability

  • User trust in the recommendation system
  • The recommendation system must demonstrate the ability to predict user preferences and interests
    • Difficult to gain trust if items familiar to the user are not shown
    • If users cannot trust the recommendation system, they will stop using it
    • Trust increases when recommendation results are good
    • Directly ask users through online evaluation
    • Estimate through user usage frequency

Exploration-oriented Metrics

  • If only similar items are recommended, users will get bored with the recommendations

  • Recommending already-known items does not help with decision-making

  • Evaluation targets: user sets, item sets, item pairs, etc.

  • Required data: rating matrix (users x items), ontology (item categories), other novelty/serendipity-related datasets

Cautions

  • Recommending only novel items reduces user satisfaction
    • Users generally prefer known items
    • When evaluated comprehensively, novelty decreases user satisfaction
    • Caution is needed when recommending novel items
  • Consider the user's experience with the recommendation system
    • Explain the novelty of the recommendation system in advance
    • Explain each recommended novel item

Serendipity

  • Whether the recommendation surprises the user
    • (The user cannot search for the item on their own)
    • Surprise when a specific item is recommended
    • Calculated using other systems?

Unexpectedness

EXEXP=1Lii=1Limax(P(si)prim(si),0)rel(si)where  si: the i-th itemLi: the evaluated recommendation listrel(si): user’s rating of item siP(si): predicted rating by the target systemprim(si): predicted rating by the primitive system\begin{aligned} \text{EXEXP} &= \frac{1}{|L_{i}|} \sum^{|L_i|}_{i=1} \max(P(s_i) - \text{prim}(s_i), 0) \cdot \text{rel}(s_i) \\ \text{where} \ \ s_i&\text{: the i-th item} \\ L_i&\text{: the evaluated recommendation list} \\ \text{rel}(s_i)&\text{: user's rating of item }s_i \\ P(s_i)&\text{: predicted rating by the target system} \\ \text{prim}(s_i)&\text{: predicted rating by the primitive system} \end{aligned}
UNEXP=RSPMSRDP=UNEXPUSEFULN\begin{aligned} \text{UNEXP} &= \frac{\text{RS}}{\text{PM}} \\ \text{SRDP} &= \frac{|\text{UNEXP} \cap \text{USEFUL}|}{N} \end{aligned}

where

RS: set of items by the recommendation systemPM: set of items by the primitive systemUNEXP: set of unexpected itemsUSEFUL: set of useful items\begin{aligned} \text{RS}&\text{: set of items by the recommendation system} \\ \text{PM}&\text{: set of items by the primitive system} \\ \text{UNEXP}&\text{: set of unexpected items} \\ \text{USEFUL}&\text{: set of useful items} \end{aligned}

Entropy-based Diversity

  • Whether items recommended by one system are also recommended by other recommendation systems
diva,v=iLa,uRupu,ilogpu,ipu,i=aAδ(a,u,i)Awhere  A: set of recommendationsa: evaluation by the target systemLa,u: recommendation list provided by system a for user uRu: items relevant to user uδ1 if iLa,uRu and 0 otherwise\begin{aligned} \text{div}_{a,v} &= -\sum_{i \in L_{a,u \cap R_{u}}} p_{u,i} \log p_{u,i} \\ p_{u,i} &= \frac{\sum_{a \in A} \delta(a,u,i)}{|A|} \\ \text{where} \ \ A&\text{: set of recommendations} \\ a&\text{: evaluation by the target system} \\ L_{a,u}&\text{: recommendation list provided by system }a\text{ for user }u \\ R_u&\text{: items relevant to user }u \\ \delta&\text{: } 1 \text{ if } i \in L_{a,u} \cap R_u \text{ and } 0 \text{ otherwise} \end{aligned}

Novelty

  • Whether the recommended item is unknown to the user

  • Discovery ratio, Precision of novelty, Item novelty, Temporal novelty, Novelty based on HLU, Long tail metric, Generalized novelty model, Unexpectedness, Entropy-based diversity, Unserendipity, HLU of serendipity

How to Measure

  • Ask about the user's relationship with a specific item

  • Calculate general popularity or similarity between items

  • In the user x item rating matrix, values become 0 or 1 (known or unknown)

Discovery Ratio

  • How many items in the recommendation list are unknown to the user
Discovery=DiLiLiwhere  Di: list of items unknown to user iLi: recommendation list for user i\begin{aligned} \text{Discovery} &= \frac{|D_{i} \cap L_{i}|}{|L_{i}|} \\ \text{where} \ \ D_{i}&\text{: list of items unknown to user }i \\ L_{i}&\text{: recommendation list for user }i \end{aligned}

Precision of Novelty

Precision(novelty)=CiLiLiRecall(novelty)=CiLiCiwhere  Ci: list of items unknown to and liked by user iLi: recommendation list for user i\begin{aligned} \text{Precision(novelty)} &= \frac{|C_{i} \cap L_{i}|}{|L_{i}|} \\ \text{Recall(novelty)} &= \frac{|C_{i} \cap L_{i}|}{|C_{i}|} \\ \text{where} \ \ C_{i}&\text{: list of items unknown to and liked by user }i \\ L_{i}&\text{: recommendation list for user }i \end{aligned}

Item Novelty

  • Measures the novelty of recommended items
    • Uses diversity within the list
Novitem(i)=N(Div(L)Div(L{i}))=1N1jLd(i,j)where  L=Nd(i,j): function measuring distance between itemsDiv(): function measuring diversity of the recommendation list\begin{aligned} \text{Nov}_{\text{item}}(i) &= N(\text{Div}(L) - \text{Div}(L - \{i\})) \\ &= \frac{1}{N-1} \sum_{j \in L} d(i,j) \\ \text{where} \ \ |L| &= N \\ d(i,j)&\text{: function measuring distance between items} \\ \text{Div}()&\text{: function measuring diversity of the recommendation list} \end{aligned}

Temporal Novelty

  • Measures whether recommendations differ from past recommendation lists
Novtmp(Li)={xLixSpast}Liwhere  Spast=j=1i1Lj\begin{aligned} \text{Nov}_{\text{tmp}}(L_{i}) &= \frac{| \{ x \in L_{i} \mid x \notin S_{\text{past}} \} |}{|L_{i}|} \\ \text{where} \ \ S_{\text{past}} &= \bigcup^{i-1}_{j=1} L_{j} \end{aligned}

Half-life Based Novelty

  • Decreases the gain of an item's rating from the normal rating based on a parameter
    • Items that are recommended but not purchased have their evaluation discounted
  • Half-life evaluation function
Ru=jmax(ru,jdefault,0)2(j1)/(α1)where  ru,j: item rated j by user udefault: default evaluation value (usually the mean)α: half-life parameter\begin{aligned} R_{u} &= \sum_{j} \frac{\max(r_{u,j} - \text{default}, 0)}{2^{(j-1)/(\alpha-1)}} \\ \text{where} \ \ r_{u,j}&\text{: item rated }j\text{ by user }u \\ \text{default}&\text{: default evaluation value (usually the mean)} \\ \alpha&\text{: half-life parameter} \end{aligned}
  • NHLU
f(i)=log2(nni)NHLUu=iNf(i)max(ru,idefault,0)2(i1)/(α1)where  n: number of usersni: number of users who rated item i\begin{aligned} f(i) &= \log_{2}\left(\frac{n}{n_{i}}\right) \\ \text{NHLU}_{u} &= \sum^{N}_{i} f(i) \frac{\max(r_{u,i} - \text{default}, 0)}{2^{(i-1)/(\alpha-1)}} \\ \text{where} \ \ n&\text{: number of users} \\ n_i&\text{: number of users who rated item }i \end{aligned}

Long-tail Based Novelty

  • Categorize items into three categories (HEAD, MID, TAIL) based on popularity
  • Extract items aia_i and aja_j from the top N recommendations
  • Evaluate the ability to recommend novel items using the following evaluation matrix
aiajai→ajHEADMIDTAIL
HEAD45%55%0%
MID5%70%25%
TAIL0%15%85%
  • If HEAD→HEAD is large, novelty is low
  • An indicator of whether popular items are being rotated

Diversity

  • Whether various types of items can be recommended to users

  • Aggregate diversity, Inter-user diversity, List personalization metric, Gini coefficient, Temporal diversity, Intra-list similarity, Subtopic retrieval, MMR, alpha-nDCG

Aggregate Diversity

  • Diversity of the recommendation system itself

  • Aggregates the types of items recommended to all users

  • A high value indicates that the recommendation system provides different items to users

Aggdiv=uULu\text{Aggdiv} = \left| \bigcup_{u \in U} L_{u} \right|

where

U: set of usersLu: list of items recommended to user u\begin{aligned} U&\text{: set of users} \\ L_{u}&\text{: list of items recommended to user }u \end{aligned}
  • Some research uses the total item set as the denominator to measure how many items out of all items were recommended to users
  • By replacing item types with subcategories/subtopics, category/topic diversity can also be measured (Subtopic retrieval metric)

Inter-user Diversity

  • Diversity across users
  • A mechanism for providing different recommendations to different users
  • How well the recommendation system is personalized for individual users (IUD)
du,v=1LuLvNIUD=1NUu,vUdu,vwhere  U: user setN=Lu=LvNU: number of pairs in ULu: list of items recommended to user u\begin{aligned} d_{u,v} &= 1 - \frac{|L_{u} \cap L_{v}|}{N} \\ \text{IUD} &= \frac{1}{N_{U}} \sum_{u,v \in U} d_{u,v} \\ \text{where} \ \ U&\text{: user set} \\ N &= |L_{u}| = |L_{v}| \\ N_{U}&\text{: number of pairs in }U \\ L_{u}&\text{: list of items recommended to user }u \end{aligned}

List Personalization Metric

  • Whether the list is optimized for individuals
pb=UbUIb=log2(UUb)Per(Li)=bjLilogUUbLiwhere  pb: probability that item b is selected by a userIb: self-entropy of item bPer(Li): degree of personalization of itemsUb: number of users who selected item b\begin{aligned} p_{b} &= \frac{U_{b}}{U} \\ I_{b} &= \log_{2}\left(\frac{|U|}{U_{b}}\right) \\ \text{Per}(L_{i}) &= \frac{\sum_{b_{j} \in L_{i}} \log \frac{|U|}{|U_{b}|}}{|L_{i}|} \\ \text{where} \ \ p_{b}&\text{: probability that item }b\text{ is selected by a user} \\ I_{b}&\text{: self-entropy of item }b \\ \text{Per}(L_{i})&\text{: degree of personalization of items} \\ U_{b}&\text{: number of users who selected item }b \end{aligned}

Gini Coefficient

  • Used as an indicator of income inequality, etc.
  • How biased the recommendation frequency of items is
  • x-axis: Percentage x% of items arranged in order of total frequency in the recommendation list
  • y-axis: Total frequency of x% of items in the list
G=BA+Bwhere  A: area between the 45-degree line and the curve when graphedB: remaining area (area under the curve)\begin{aligned} G &= \frac{B}{A + B} \\ \text{where} \ \ A&\text{: area between the 45-degree line and the curve when graphed} \\ B&\text{: remaining area (area under the curve)} \end{aligned}

Temporal Diversity

  • Measurement when the system outputs different items at different times
divtmp(L1,L2,N)={xL2xL1}N\text{div}_{\text{tmp}}(L_{1}, L_{2}, N) = \frac{| \{ x \in L_{2} \mid x \notin L_{1} \} |}{N}

where N=L1=L2N = |L_{1}| = |L_{2}|

Intra-list Diversity

  • Diversity based on the content of the recommendation list
  • Calculated using the similarity between two items
    • Typically, feature vectors or categories are used
Diversity=lLi,mLi1similarity(l,m)\text{Diversity} = \sum_{l \in L_{i}, m \in L_{i}} \frac{1}{\text{similarity}(l,m)}

where LiL_{i}: list of items recommended to user ii

ILS(Li)=bkLibeLi,bkbesimilarity(bk,be)(Li2)\text{ILS}(L_{i}) = \frac{\sum_{b_{k} \in L_{i}} \sum_{b_{e} \in L_{i}, b_{k} \not= b_{e}} \text{similarity}(b_{k}, b_{e})}{\binom{|L_{i}|}{2}}

MMR (Maximal Marginal Relevance)

  • Originally a document selection method for search results
  • This concept is applied to evaluate recommendation result lists
MMRList=diL(αrel(di)(1α)maxdjLi1sim(di,dj))\text{MMR}_{\text{List}} = \sum_{d_{i} \in L} (\alpha \cdot \text{rel}(d_{i}) - (1 - \alpha) \max_{d_{j} \in L^{i-1}} \text{sim}(d_{i}, d_{j}))

where

L: recommendation listdi: item in the listrel: whether the user likes the itemLi1: recommendation list up to the j-th item\begin{aligned} L&\text{: recommendation list} \\ d_{i}&\text{: item in the list} \\ \text{rel}&\text{: whether the user likes the item} \\ L^{i-1}&\text{: recommendation list up to the }j\text{-th item} \end{aligned}

Notes on Recommendation System Evaluation Metrics

  • AP and nDCG can evaluate ranked recommendations for specific users
  • When evaluating a system, it is necessary to evaluate the average quality across multiple users

MAP

  • The arithmetic mean of AP over a set of N users
MAP=1Ni=1NAPi\text{MAP} = \frac{1}{N} \sum^{N}_{i=1} \text{AP}_{i}
  • MAP alone is insufficient when the goal is to present reasonably relevant recommendations to as many users as possible

GMAP (Geometric Mean AP)

GMAP=nAPnn\text{GMAP} = \sqrt[n]{\prod_{n} \text{AP}_{n}}

For convenience, this can be transformed into the following form:

Where epsilon is a very small value (around 10^(-5)):

GMAP=exp(1Ni=1Nln(APi+ϵ))ϵ\text{GMAP} = \exp\left(\frac{1}{N} \sum^{N}_{i=1} \ln(\text{AP}_{i} + \epsilon)\right) - \epsilon

When recommendations for some users contain no relevant items, GMAP evaluates lower than MAP.

  • When the measurement range is short (e.g., top 10), MAP and GMAP values tend to be unstable

infAP (Inferred Average Precision)

  • Estimates the true average precision from incomplete test collections

Policy-based Evaluation Metrics

  • Distribution of specific item attributes
    • When there are categories to promote, evaluation can be based on the distribution of specific attributes of the provided items
  • Frequency of NG item appearances
    • The goal is basically 0
  • Transition rate to target items/categories
    • Measured when controlling where users navigate to
  • Gini coefficient of page views per item
    • Measured to see if bias is corrected when introducing mechanisms to prevent item view bias by the system

Interleaving

Industry Status in 2017 (RecSys 2017 Conference)

RecSys_2017_Conference

  • At the recommendation systems conference, the metrics used by companies as of 2017, in order of frequency:
  1. MRR (ranked recommendation list metric considering ordering, weighted by reciprocal of rank)
  2. AUC (classification problem)
  3. Precision (finding one relevant item)
  4. nDCG (ranked recommendation list metric considering ordering, with adjustable weights)
  5. Recall (enumerating all relevant items)
  • Although exploration-oriented evaluation metrics have been proposed in papers, in real environments the answer data is difficult to obtain, and they are not widely used in practice yet.
  • MRR and nDCG are relatively newer methods but are widely used.

References