Similarity Metrics Between Samples

A summary of research on evaluating similarity between entities in recommendation systems.

Metrics for measuring the similarity of items and users
Primarily used in collaborative filtering and K-NN to find "similar" users/items

Classic Metrics

Pearson Correlation

The standard correlation coefficient
Takes values from -1 to +1

\begin{aligned} \rho &= \frac{\sigma_{XY}}{\sigma_x \sigma_y} \\ &= \frac{E\bigl[ (X-E[X])(Y-E[Y]) \bigr]}{\sqrt{E[(X-E[X])^2]}\sqrt{E[(Y-E[Y])^2]}} \end{aligned}

Cosine

The cosine of the angle between vector A and vector B
Takes values from -1 to 1

cosineSim = \frac{\sum^n_{i=1} a_ib_i}{\sqrt{\sum^n_{i=1} a_i^2}\sqrt{\sum^n_{i=1} b_i^2}}

Adjusted Cosine

The fundamental difference between user-based CF (collaborative filtering) and item-based CF similarity calculation is whether the computation is performed along rows or columns
In item-based CF, using cosine to compute similarity has one critical flaw:
- Cosine reflects the angle but not the position
Adjusted cosine compensates for this drawback by subtracting each user's mean rating from the corresponding co-rated pairs

\begin{aligned} sim(i,j) &= \frac{\sum_{u \in U}(R_{u,i}-\bar R)(R_{u,j}-\bar R_u)}{\sqrt{\sum_{u \in U}(R_{u,i}-\bar R_u)^2}\sqrt{\sum_{u \in U}(R_{u,j}-\bar R_u)^2}} \\ \text{where} \ \ \bar R_u&: \text{mean rating of user u} \end{aligned}

Euclidean

Euclidean distance
Typically normalized for use as a similarity measure

sim = 1 / (1+\sqrt{\sum^n_{i=1}(p_i-q_i)^2})

Jaccard Similarity

Targets global ratings
The ratio of the cardinality of co-rated items to the cardinality of all items rated by both users

Sim(u,v)^{jaccard} = \frac{(I_u \cap I_v)}{I_u \cup I_v}

Mean Square Distance

Mean square distance is calculated as the ratio of the difference in ratings of co-rated items to the cardinality of co-rated items
Mean square similarity is calculated by subtracting MSD from 1

Sim(u,v)^{MSD} = 1 - \frac{\sum_{i\in I(u,v)}(R(u,i)-R(v,i))^2}{|I(u,v)|}

Newer Metrics

JMSD

Combines numerical information from MSD and non-numerical information from Jaccard
Computes partial similarities from Jaccard and MSD, and the final similarity is generated by multiplying these two similarity measures

\begin{aligned} Sim_{(u,v)}^{JMSD} &= (Sim(u,v)^{Jaccard})(Sim(u,v)^{MSD}) \\ &=\Bigl( \frac{(I_u \cap I_v)}{I_u \cup I_v} \Bigr) \Bigl( \frac{\sum_{i\in I(u,v)}(R(u,i)-R(v,i))^2}{|I(u,v)|} \Bigr) \end{aligned}

SING

A similarity metric specialized for memory-based collaborative filtering
The paper is not publicly available (only the Abstract and References are available)

GEN

A similarity metric based on genetic algorithms
Aimed at improving the accuracy and results of CF-based recommendation systems

\begin{aligned} sim_w(x,y) &= \frac{1}{M-m+1}\sum^{M-m}_{i=0}w^{i}v^{i}_{x,y} \\ \text{where} \ \ m&: \text{minimum rating value (usually 1)} \\ M&: \text{maximum rating value (usually 5 or 10)}\\ v&: \text{difference vector of two users' ratings} \\ w&: \text{similarity function coefficients that minimize the MAE of the recommendation system} \end{aligned}

Example
- Assume rating vectors r_1 = (4, 5, x, 3, 2, x, 1, 1, 4) and r_2 = (4, 3, 1, 2, x, 3, 4, x, 2)
- The rating differences are r_d = (0, 2, x, 1, x, x, 3, x, 2)
- Count the proportions of rating differences: v = (1/5, 1/5, 2/5, 1/5, 0) (proportion of difference 0, proportion of difference 1, proportion of difference 2, ...)
- w takes values from (1, 0.5, 0, -0.5, -1)
  - 1 means very similar, -1 means very dissimilar
  - 0 means that rating difference is estimated to be unrelated to similarity
w is learned with a genetic algorithm to minimize the overall system MAE

TRUST

Derives user trustworthiness from rating data
The paper is not publicly available (only the Abstract and References are available)

Problems with Classic Metrics

For example, consider rating vectors U=(2, 0, 3, 0) and V=(5, 2, 0, 2)
- Both users co-rated only 1 item
- Pearson similarity is 0
- Cosine similarity is 1
U=(2, 1, 3, 2), V=(1, 2, 2, 3) appear similar, but Pearson similarity is 0
U=(2, 2, 0, 1), V=(4, 4, 0, 2) yields cosine similarity of 1
- When there are multiple ratings, cosine similarity tends to be high
- Pulled by less important rating values like 1
U=(5, 5, 1, 1), V=(1, 1, 5, 5) yields very high Jaccard, but the preferences are clearly different
- Ignoring rating values and focusing only on whether co-rating exists discards a large amount of information

Classic Metrics​

Pearson Correlation​

Cosine​

Adjusted Cosine​

Euclidean​

Jaccard Similarity​

Mean Square Distance​

Newer Metrics​

JMSD​

SING​

GEN​

TRUST​

Problems with Classic Metrics​

Classic Metrics

Pearson Correlation

Cosine

Adjusted Cosine

Euclidean

Jaccard Similarity

Mean Square Distance

Newer Metrics

JMSD

SING

GEN

TRUST

Problems with Classic Metrics