Similarity Metrics Between Samples
A summary of research on evaluating similarity between entities in recommendation systems.
- Metrics for measuring the similarity of items and users
- Primarily used in collaborative filtering and K-NN to find "similar" users/items
Classic Metrics
Pearson Correlation
- The standard correlation coefficient
- Takes values from -1 to +1
Cosine
- The cosine of the angle between vector A and vector B
- Takes values from -1 to 1
Adjusted Cosine
- The fundamental difference between user-based CF (collaborative filtering) and item-based CF similarity calculation is whether the computation is performed along rows or columns
- In item-based CF, using cosine to compute similarity has one critical flaw:
- Cosine reflects the angle but not the position
- Adjusted cosine compensates for this drawback by subtracting each user's mean rating from the corresponding co-rated pairs
Euclidean
-
Euclidean distance
-
Typically normalized for use as a similarity measure
Jaccard Similarity
- Targets global ratings
- The ratio of the cardinality of co-rated items to the cardinality of all items rated by both users
Mean Square Distance
- Mean square distance is calculated as the ratio of the difference in ratings of co-rated items to the cardinality of co-rated items
- Mean square similarity is calculated by subtracting MSD from 1
Newer Metrics
JMSD
- Combines numerical information from MSD and non-numerical information from Jaccard
- Computes partial similarities from Jaccard and MSD, and the final similarity is generated by multiplying these two similarity measures
SING
-
A similarity metric specialized for memory-based collaborative filtering
-
The paper is not publicly available (only the Abstract and References are available)
GEN
-
A similarity metric based on genetic algorithms
-
Aimed at improving the accuracy and results of CF-based recommendation systems
-
Example
- Assume rating vectors r_1 = (4, 5, x, 3, 2, x, 1, 1, 4) and r_2 = (4, 3, 1, 2, x, 3, 4, x, 2)
- The rating differences are r_d = (0, 2, x, 1, x, x, 3, x, 2)
- Count the proportions of rating differences: v = (1/5, 1/5, 2/5, 1/5, 0) (proportion of difference 0, proportion of difference 1, proportion of difference 2, ...)
- w takes values from (1, 0.5, 0, -0.5, -1)
- 1 means very similar, -1 means very dissimilar
- 0 means that rating difference is estimated to be unrelated to similarity
-
w is learned with a genetic algorithm to minimize the overall system MAE
TRUST
- Derives user trustworthiness from rating data
- The paper is not publicly available (only the Abstract and References are available)
Problems with Classic Metrics
- For example, consider rating vectors U=(2, 0, 3, 0) and V=(5, 2, 0, 2)
- Both users co-rated only 1 item
- Pearson similarity is 0
- Cosine similarity is 1
- U=(2, 1, 3, 2), V=(1, 2, 2, 3) appear similar, but Pearson similarity is 0
- U=(2, 2, 0, 1), V=(4, 4, 0, 2) yields cosine similarity of 1
- When there are multiple ratings, cosine similarity tends to be high
- Pulled by less important rating values like 1
- U=(5, 5, 1, 1), V=(1, 1, 5, 5) yields very high Jaccard, but the preferences are clearly different
- Ignoring rating values and focusing only on whether co-rating exists discards a large amount of information