Skip to main content

Similarity Metrics Between Samples

A summary of research on evaluating similarity between entities in recommendation systems.

  • Metrics for measuring the similarity of items and users
  • Primarily used in collaborative filtering and K-NN to find "similar" users/items

Classic Metrics

Pearson Correlation

  • The standard correlation coefficient
  • Takes values from -1 to +1
ρ=σXYσxσy=E[(XE[X])(YE[Y])]E[(XE[X])2]E[(YE[Y])2]\begin{aligned} \rho &= \frac{\sigma_{XY}}{\sigma_x \sigma_y} \\ &= \frac{E\bigl[ (X-E[X])(Y-E[Y]) \bigr]}{\sqrt{E[(X-E[X])^2]}\sqrt{E[(Y-E[Y])^2]}} \end{aligned}

Cosine

  • The cosine of the angle between vector A and vector B
  • Takes values from -1 to 1
cosineSim=i=1naibii=1nai2i=1nbi2cosineSim = \frac{\sum^n_{i=1} a_ib_i}{\sqrt{\sum^n_{i=1} a_i^2}\sqrt{\sum^n_{i=1} b_i^2}}

Adjusted Cosine

  • The fundamental difference between user-based CF (collaborative filtering) and item-based CF similarity calculation is whether the computation is performed along rows or columns
  • In item-based CF, using cosine to compute similarity has one critical flaw:
    • Cosine reflects the angle but not the position
  • Adjusted cosine compensates for this drawback by subtracting each user's mean rating from the corresponding co-rated pairs
sim(i,j)=uU(Ru,iRˉ)(Ru,jRˉu)uU(Ru,iRˉu)2uU(Ru,jRˉu)2where  Rˉu:mean rating of user u\begin{aligned} sim(i,j) &= \frac{\sum_{u \in U}(R_{u,i}-\bar R)(R_{u,j}-\bar R_u)}{\sqrt{\sum_{u \in U}(R_{u,i}-\bar R_u)^2}\sqrt{\sum_{u \in U}(R_{u,j}-\bar R_u)^2}} \\ \text{where} \ \ \bar R_u&: \text{mean rating of user u} \end{aligned}

Euclidean

  • Euclidean distance

  • Typically normalized for use as a similarity measure

sim=1/(1+i=1n(piqi)2)sim = 1 / (1+\sqrt{\sum^n_{i=1}(p_i-q_i)^2})

Jaccard Similarity

  • Targets global ratings
  • The ratio of the cardinality of co-rated items to the cardinality of all items rated by both users
Sim(u,v)jaccard=(IuIv)IuIvSim(u,v)^{jaccard} = \frac{(I_u \cap I_v)}{I_u \cup I_v}

Mean Square Distance

  • Mean square distance is calculated as the ratio of the difference in ratings of co-rated items to the cardinality of co-rated items
  • Mean square similarity is calculated by subtracting MSD from 1
Sim(u,v)MSD=1iI(u,v)(R(u,i)R(v,i))2I(u,v)Sim(u,v)^{MSD} = 1 - \frac{\sum_{i\in I(u,v)}(R(u,i)-R(v,i))^2}{|I(u,v)|}

Newer Metrics

JMSD

  • Combines numerical information from MSD and non-numerical information from Jaccard
  • Computes partial similarities from Jaccard and MSD, and the final similarity is generated by multiplying these two similarity measures
Sim(u,v)JMSD=(Sim(u,v)Jaccard)(Sim(u,v)MSD)=((IuIv)IuIv)(iI(u,v)(R(u,i)R(v,i))2I(u,v))\begin{aligned} Sim_{(u,v)}^{JMSD} &= (Sim(u,v)^{Jaccard})(Sim(u,v)^{MSD}) \\ &=\Bigl( \frac{(I_u \cap I_v)}{I_u \cup I_v} \Bigr) \Bigl( \frac{\sum_{i\in I(u,v)}(R(u,i)-R(v,i))^2}{|I(u,v)|} \Bigr) \end{aligned}

SING

  • A similarity metric specialized for memory-based collaborative filtering

  • The paper is not publicly available (only the Abstract and References are available)

GEN

  • A similarity metric based on genetic algorithms

  • Aimed at improving the accuracy and results of CF-based recommendation systems

simw(x,y)=1Mm+1i=0Mmwivx,yiwhere  m:minimum rating value (usually 1)M:maximum rating value (usually 5 or 10)v:difference vector of two users’ ratingsw:similarity function coefficients that minimize the MAE of the recommendation system\begin{aligned} sim_w(x,y) &= \frac{1}{M-m+1}\sum^{M-m}_{i=0}w^{i}v^{i}_{x,y} \\ \text{where} \ \ m&: \text{minimum rating value (usually 1)} \\ M&: \text{maximum rating value (usually 5 or 10)}\\ v&: \text{difference vector of two users' ratings} \\ w&: \text{similarity function coefficients that minimize the MAE of the recommendation system} \end{aligned}
  • Example

    • Assume rating vectors r_1 = (4, 5, x, 3, 2, x, 1, 1, 4) and r_2 = (4, 3, 1, 2, x, 3, 4, x, 2)
    • The rating differences are r_d = (0, 2, x, 1, x, x, 3, x, 2)
    • Count the proportions of rating differences: v = (1/5, 1/5, 2/5, 1/5, 0) (proportion of difference 0, proportion of difference 1, proportion of difference 2, ...)
    • w takes values from (1, 0.5, 0, -0.5, -1)
      • 1 means very similar, -1 means very dissimilar
      • 0 means that rating difference is estimated to be unrelated to similarity
  • w is learned with a genetic algorithm to minimize the overall system MAE

TRUST

  • Derives user trustworthiness from rating data
  • The paper is not publicly available (only the Abstract and References are available)

Problems with Classic Metrics

  • For example, consider rating vectors U=(2, 0, 3, 0) and V=(5, 2, 0, 2)
    • Both users co-rated only 1 item
    • Pearson similarity is 0
    • Cosine similarity is 1
  • U=(2, 1, 3, 2), V=(1, 2, 2, 3) appear similar, but Pearson similarity is 0
  • U=(2, 2, 0, 1), V=(4, 4, 0, 2) yields cosine similarity of 1
    • When there are multiple ratings, cosine similarity tends to be high
    • Pulled by less important rating values like 1
  • U=(5, 5, 1, 1), V=(1, 1, 5, 5) yields very high Jaccard, but the preferences are clearly different
    • Ignoring rating values and focusing only on whether co-rating exists discards a large amount of information