Recommendation Models

A summary of recent trends in recommendation models, also known as recommender systems.

Various descriptions exist, and the scope of terminology may differ depending on the author
$\langle w, x\rangle$ denotes the inner product of w and x

Classification

Recommendation systems are broadly divided into 3 types:
- Rule-based: Does not use other users' information at inference time
- Memory-based: Uses other users' information at both training and inference time
- Model-based: Uses other users' information at training time but not at inference time

	Rule-based	Memory-based	Model-based
Degree of personalization	Low	High	Depends on the model
Method	Recommend according to rules	Compute from overall item/user information for each recommendation	Input target user's information into the model for each recommendation
Techniques		Collaborative filtering	Cluster models, function-based, Bayesian

Other classifications:
- No personalization: Same recommendation for all users
- Temporary personalization: Recommendations based only on session data (temporary data)
- Persistent personalization: Recommendations based on long-term data (history, etc.)
- Non-updating personalization: Inference using basic information registered by the user that is generally not updated, or data valid only for the current session

-	No personalization	Temporary personalization	Persistent personalization	Non-updating personalization
Item	Spec ranking (by price)	Similar/complementary items in the cart	Similar trending items from history	Non-interactive chatbot / salesperson suggestions
People	Popularity ranking	Frequently bought together	Collaborative filtering (similar trending items from ratings/behavior)	Surveys, etc.
Knowledge	Store/season-specific recommendations	Checkout recommendations	Rare / expert recommendations for individuals	Chatbot / salesperson suggestions

Collaborative Filtering

A technique that uses user rating (behavior) information to recommend items based on information from other users similar to the target user

Advantages

No domain knowledge required
Can recommend across different genres (serendipity)

Disadvantages

Cold start problem
- Cannot recommend for new users or items with few purchases (actions) at the start of the service
Low coverage
- Items recommended tend to be those that are frequently purchased
- Items that have not been purchased cannot be recommended
Cannot recommend similar items
- Item similarity is unknown

Procedure

Accumulate user rating (behavior) information
- If not accumulated, use rule-based or content-based systems
Find users with similar preferences from rating (behavior) information
- See the similarity metrics survey for how to calculate similarity
- When there are many users, pre-clustering users by behavior history can save computational resources
Recommend items that similar users like and that are predicted to have high ratings for the target user
- k-NN and similar methods are used here

Problems

Data Sparsity

When there are very many items, users do not purchase most products
- Therefore, even if users are actually similar, they cannot be identified as similar unless they have both rated the exact same items (transitive similarity problem)
Solutions include the following:
- Dimensionality reduction via SVD
- Dimensionality reduction via matrix factorization (see the Model / Function-based / Regression section)
- Complementation via hybrid collaborative filtering
  - Calculate similarity based on items and users without using user history
- Data complementation through predictive model construction
  - Predict and fill in evaluation values for unrated items

Scalability

When the number of items or users grows very rapidly, computational cost increases
Solutions include the following:
- Before using CF, cluster similar items/users using item-based/user-based approaches to narrow down computation targets (similar to hybrid collaborative filtering)

Gray Sheep

Users whose preferences match multiple patterns or none at all, and therefore do not benefit from filtering
Solutions include the following:
- Differentiation through behavior patterns other than collaborative filtering via hybrid CF

Shilling Attack

Refers to attacks on the rating system (such as fake reviews)
Solutions include the following:
- Use item-based collaborative filtering
  - Less susceptible than user-based approaches
- Use a hybrid of item-based + CF collaborative filtering

Classification

Can be classified as memory-based, model-based, or hybrid

Memory-based CF

Makes recommendations based on ratings from similar users, using correlations between users and items

Representative Techniques

User-based
- Analyzes similarity between users based on user behavior history and recommends items with high ratings
Item-based
- Analyzes similarity between items based on user behavior history and recommends highly related items

Advantages

Easy to implement
Easy to add data
No need to consider content
No scaling of co-rated items required

Disadvantages

Dependence on human ratings
- Accuracy improves by standardizing ratings
Performance degradation with sparse data
Cold start
Scalability
- Can be somewhat improved by pre-clustering users and items
- However, updates at a certain frequency are necessary

Model-based CF

Models user purchasing behavior using statistical algorithms and makes recommendations based on the model

Representative Techniques

Bayesian network CF
Clustering CF
Latent semantic CF
Factor analysis
CF with dimensionality reduction

Advantages

Performance with sparse data
Improved prediction performance
Intuitive explanations

Disadvantages

Complex model construction
Scalability (faster than memory-based)
- Can update nightly or weekly/monthly
Adaptability
- Weak to changes in users and items
  - Cannot adapt without model updates
Loss of useful information due to dimensionality reduction

Hybrid CF

Uses both (non-CF) content-based and CF approaches for recommendations

Representative Techniques

Content-based CF
Memory-based + model-based CF

Advantages

Can overcome individual problems
Improved prediction performance
Addresses sparsity and gray sheep issues

Disadvantages

Complex construction
May require external data (information other than user behavior information) in some cases

Data Used

Explicit Data

Users explicitly provide ratings
Good/Bad, n-level ratings
Small data volume but accurate, with clear distinction between unrated and disliked

Implicit Data

Automatically collected information such as browsing data and purchase data
Large data volume but inaccurate, with unclear distinction between unrated and disliked

Techniques

SVD

Definition

For any m x n real matrix A (rank=k), there exist an m x m orthogonal matrix U, an n x n orthogonal matrix V, and a matrix S with non-diagonal elements equal to 0 and diagonal elements greater than 0 (if the column is k or less) or equal to 0 (if the column is greater than k), such that A can be decomposed as follows:
- The components of the difference from the matrix below (for matrix U, the rows from r to m) are all 0
The singular value decomposition of A is expressed as:

\begin{aligned} A &= U \Gamma V^T \\ A &: m \times n, ~U: m \times r,~ \Gamma: r \times r,~ V^T: r \times n \end{aligned}

U, V are orthogonal matrices, \Gamma is a diagonal matrix, and the diagonal elements of \Gamma are the singular values of A (> 0)
Dimensionality is reduced by truncating \Gamma entries whose diagonal elements are below a certain threshold

Properties

The following matrix Aj* is the least squares approximation of rank j for A

A_{j*} = \sum^j_{i=1}\lambda_ju_iv_i

\Gamma is uniquely determined for A, but U and V are not necessarily uniquely determined

SVD ++

Positioned as an evolution of MF as dimensionality reduction, viewed as SVD with regularization
See Memory-based / Factorization Machines / SVD++

Comparison with Other Dimensionality Reduction Techniques

NMF
- A method with the constraint that coefficients must be non-negative
- Not suitable because post-decomposition coefficients become sparse
PCA
- Can take values below 0
- However, in practice, browsing, ratings, and behavior are non-negative and never go below 0
LDA
- The assumption is not met because data does not follow a multivariate normal distribution
ICA
- The assumption is not met because ratings/behavior history are not statistically independent of each other

Memory-based

Collaborative Filtering to Recommendation Flow

Calculate similarity scores between users/items
1. Define what constitutes similarity
2. Prepare data according to the definition
3. Select and implement a calculation method according to the definition
4. Actually calculate similarity scores
Make recommendations weighted by similarity scores
1. Weight by similarity scores (increase the influence of similar users/items)
2. Normalize
3. Sort appropriately and display results in order of highest score

Examples

User-based

---	Item 1	Item 2	Item 3	Item 4	Item 5	Item 6	Item 7	Item 8	Similarity
User A	1	0	x	1	0	x	1	0	1.0
User B	0	0	1	1	1	0	0	1	0.5
User C	0	0	1	0	1	0	0	x	0.25
User D	0	1	0	x	1	0	1	0	0.0
Target user	1	0	x	1	0	x	x	x	-
Score (n=1)	-	-	x	-	-	x	1	0	ref A
n=2	-	-	1	-	-	0	0.5	0.5	ref A&B

Similarity is simplified using match rate (proportion of matching values)
Evaluation is simplified to 1 (purchased or not, liked or not)
K-NN (n=1): Recommend Item 7, which is highly rated by the most similar User A
K-NN (n=2): Recommend Item 3, which has the highest average rating from similar Users A and B

Common Improvements

There are many improvements in computing similarity between users and calculating which items to recommend from similar users
- Include whether items were viewed in the similarity calculation
- Weight by similarity when calculating items to recommend
- Weight by the number of raters, etc.
For sparse data, apply dimensionality reduction
- When using PCA, fill missing values with default voting values (see below) or the EM algorithm
To reduce dependence on rating values, convert to rankings and use ranking-based similarity metrics

User-based Specific Improvements

When there are few common raters per user, fill unrated items with the overall rating for that item from all users (default voting value)
Give more weight to extreme ratings (minimum and maximum)

Item-based

	Item 1	Item 2	Item 3	Item 4	Item 5	Item 6	Item 7	Item 8
User A	1	0	x	1	0	x	1	0
User B	0	0	1	1	1	0	0	1
User C	0	0	1	0	1	0	0	x
User D	0	1	0	x	1	0	1	0
Target user	1	0	x	1	0	x	x	x
Similarity with Item 1	-	-	0.33	-	-	1.00	0.75	0.33
Similarity with Item 4	-	-	0.5	-	-	0.5	0.67	0.5
Similarity with highly rated items	-	-	0.42	-	-	0.75	0.71	0.42

Simplified using the same calculation method as user-based
Improvement methods are also similar to user-based
Using items 4 that the target user rated highly, calculate similarity of unrated items
Recommend Item 6, which has the highest similarity

Item-based Specific Improvements

By recommending items similar to co-occurrence, browsing, shopping, and recent usage history, temporary personalization can be achieved
Advantageous over user-based in terms of privacy protection

Problems

In situations where recommended items change frequently (such as movies and concerts), frequent neighbor updates are required
Experimentally tends to be biased toward specific items (though there is no theoretical basis)

Differentiating User-based and Item-based

User-based
- Fits in memory with frequent changes
- Each user's preferences have unique values
  - Sites that recommend specific things
- High serendipity (discovering unexpected items)
Item-based
- Targets large data
- When there are many items, it is difficult to compute similarity between users
  - Few people buy the same items, or the matrix becomes sparse
- Item-to-item similarity can be pre-computed, making production recommendations easier
- Item-to-item similarity changes less frequently, requiring fewer recalculations
- Easy to explain

Factorization Machines

A general-purpose model presented at ICDM in 2010
Efficiently computes feature interactions
- When there are features representing age and gender, the AND condition feature of age + gender is represented by the interaction between age and gender
- The interaction weight is learned as the inner product of vectors corresponding to each feature
- Higher-order FMs have been proposed, but as of 2018, accuracy improvements only go up to around 3rd order
Somewhat different in concept from CF

user A	user B	user C	Item a	Item b
1	0	0	1	0
1	0	0	0	1
0	1	0	1	0
0	1	0	0	1
0	0	1	0	1

Each rating is represented as one row, and the user-item interaction feature vector is extracted
Interactions can freely include time, context, etc.
- For example, adding Age allows the model to handle User x Item x Age interactions

Prediction Formula

2nd-order FM

\begin{aligned} \hat y(x) &:= w_0 + \sum^n_{i=1}w_ix_i+\sum^n_{i=1}\sum^n_{j=i+1}\langle v_i,v_j\rangle x_ix_j \\ \text{where} \ \ w_0 &\in \mathbb{R},\ \ w \in \mathbb{R}^n,\ \ V \in \mathbb{R}^{n\times k} \\ \langle v_i,v_j\rangle &:= \sum^k_{f=1}v_{i,f} v_{j,f} \end{aligned}

Row v_i in V is the i-th variable with k factors

k in N+0 is a hyperparameter defining the dimensionality of the factorization

Reference: Standard 2nd-order linear model with interactions
- When directly applied to recommendation data, $x_ix_j$ is sparse, making it difficult to learn $w_{i,j}$
- Therefore, that part is learned as an inner product of vectors
  - This also reduces the number of learning parameters

\hat y(x) := w_0 + \sum^n_{i=1}w_ix_i+\sum^n_{i}\sum^n_{j=i+1}w_{i,j}x_ix_j

2-way FM captures single-variable + pairwise interactions between variables
w0 is the global bias
wi models the strength of the i-th variable
$w^i,j := <v_i,v_j>$ models the interaction between the i-th and j-th variables
Instead of using a parameter w_i,j for each interaction, the FM model factorizes to model interactions
- This is important for sparse vectors

Prediction

Regression
- Use y^(x) directly as the prediction target and train with MSE, etc.
Binary classification
- Use y^(x) as the label and train with logistic regression or hinge loss
Ranking
- Rank vector x by $y^{(x)}$ and train with pairwise classification loss for pairs $x(a), x(b) \in D$

Optimization Methods

SGD, ALS, MCMC, etc.
Can train with various losses including MSE, logistic loss, hinge loss
Use L2 regularization to prevent overfitting
- Early stopping may give better accuracy
The gradient is expressed as follows (for the 1st order case):

\begin{equation} \frac{d}{d\theta}\hat y(x)= \Biggl\{ \begin{array}{1} 1, ~if~\theta~is~w_0 \\ x_i, ~if~\theta~is~w_i \\ x_i\sum^n_{i=1}v_{i,f}x_j- v_{i,f}x^2_i,~if~\theta~is~v_{i,f} \end{array} \end{equation}

The partial derivative d is written as plain d due to formula rendering issues.

SVD++

Original MF (SVD)

\begin{aligned} \hat r_{ui} &= \mu + b_{user}(u)+b_{item}(i)+U_uI_i \\ \text{where} \ \ \hat r_{ui}&: \text{predicted rating} \\ b_{user}&: \text{individual bias} \\ I&: \text{item attributes}(I \in R^{k\times m}) \\ \mu&: \text{global mean} \\ b_{item}&: \text{item bias} \\ U&: \text{user attributes}(U \in R^{u \times k}) \end{aligned}

loss = \min_{b,U,I}\sum_{u,i\in R}(r_{ui}-\hat r_{ui})^2 + \lambda(|b_{user}|^2+|b_{item}|^2+|U^T_u|^2+|I_i|^2)

SVD++

$\hat{r}_{ui} = \mu + b_{user}(u)+b_{item}(i)+(U_u+|R(u)|^{-0.5}\sum_{j \in R(u)}y_j)I_i$

where $R(u)$ is the set of items rated by user $u$ and $y_j$ is the effect of a user action on item $j$ .

The loss function is:

$\text{loss} = \text{argmin}\left[\sum(r_{ui}-\hat{r}_{ui})^2+\lambda_1(||b||^2+||U||^2)+\lambda_2(||I||^2+||y||^2)\right]$

Field-aware FM; FFM

Original paper: https://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf
Competitions won with this model (by the paper's authors):
- Logarithmic Loss https://www.kaggle.com/c/criteo-display-ad-challenge
- CTR https://www.kaggle.com/c/avazu-ctr-prediction

Prediction Formula

An extension of FM
- FM learns one embedding vector per feature
- FFM improves this by dividing by Field (original category classification) and learning Field-number of embedding vectors per feature
FM assumed one-hot encoding of categorical variables as features, so different attribute information was uniformly converted to 1 values
- Information about which original categorical variable each feature belongs to was lost
- FFM can express field-specific interaction factors

\hat{y}(x) := w_0 + \sum_{i=1}^n w_i x_i + \sum_{i=1}^n \sum_{j=i+1}^n \langle v_{if_j}, v_{jf_i} \rangle x_i x_j

where $f_i$ and $f_j$ denote the fields that $i$ and $j$ belong to respectively.

Training Method

Same as FM
However, due to the increased number of parameters, it is more prone to overfitting, so early stopping is used

Deep FM Variants

Factorization Machines https://data.gunosy.io/entry/deep-factorization-machines-2018#f-95055ec5
Vector-wise interactions are beneficial
Explicit architectures are needed to learn higher-order features
- Different properties from standard DNNs
- (A network with different properties from standard DNNs is needed)

Factorization-machine supported Neural Network (FNN)

ECIR2016
Standard MLP
- Categorical input data is one-hot encoded per variable, then embedded and fed into a dense network
- FM weights and vectors are used to initialize layer weights

Deep Crossing Network (DCN)

KDD2016
Replaces the MLP part of FNN with ResNet-like skip connections to predict residuals
Since experiments were on internal data, it is not widely cited, but later research shows higher accuracy than FNN

Wide & Deep (arXiv:1606.07792)

2016
Uses embedded categorical values, same as FNN and DCN
Attaches a linear model at the output layer
Better accuracy than FNN and DCN, plus a TensorFlow API exists

Product-based Neural Network (PNN)

ICDM2017
Embedding categorical variables makes it difficult to handle explicit interactions between categorical variables
Incorporates categorical variable interaction terms into the model architecture
Creates a layer for interactions between categorical variables
PNN is more accurate than FNN with stacked embeddings
- (Possibly weaker than Wide & Deep)

DeepFM

IJCAI2017
Evolution of Wide & Deep?
Feed sparse features into an embedding layer
- Branch 1: Feed into Factorization Machine
- Branch 2: Feed into a dense network
Combine Branch 1 and 2 for output
More accurate than FM & Deep (version without embedding the FM input)

Deep & Cross Network (CrossNet)

ADKDD2017

Use continuous values as-is, embed categorical variables as features
Branch as follows:
1. Branch 1: Feed into a Cross Network that considers interaction terms
2. Branch 2: Feed into a dense network
Combine Branches 1 and 2, apply FC, then output with Sigmoid

xDeepFM

KDD2018
Learns higher-order features explicitly at the vector level using an architecture called Compressed Interaction Network (CIN)

Branch:
1. Linear layer
2. Embedding
Branch from 1-2 Embedding:
1. CIN
2. Plain DNN
Combine 1-1, 2-1, 2-2 for output

FFM Extensions

https://qiita.com/guglilac/items/6c8971d27c143e2567a4
Extensions based on FFM rather than FM, so they are relatively newer
Understanding is somewhat shallow

Field-aware Probabilistic Embedding Neural Network (FPENN)

RecSys2018
Prevents overfitting + introduces NN
Same approach of learning Field-number of embeddings
- However, estimates the parameters of the probability density that the embedding follows
- This enables robust learning against noise in the training data
Combination with Deep follows DeepFM (?)

Field-weighted Factorization Machine (FwFM)

WWW2018
The importance for prediction varies for each pair of fields
- Therefore, the importance of each field pair is also formalized

Interaction-aware Factorization Machines (IFM)

AAAI 2019
Attention + FM
Considers attention at both feature level and field level
Since attention is not well understood, this model and the one below are not fully grasped

Field Attention Deep Field-aware Factorization Machine (FAT-DeepFFM)

ICDM 2019
Attention + FFM
Considers attention at the embedding stage

Model-based

Cluster Model

Method

Create clusters of users with similar preference patterns
- Uses item rating values, etc.
Find the most similar cluster for the user
Recommend items in order of highest average rating within the cluster

Properties

The nature of recommendations varies greatly depending on the number of clusters
- The number of clusters needs to be adjusted according to the objective
The types of recommendations are limited by the number of clusters
- With few clusters, recommendations are broad and not very personalized
  - On the other hand, more robust against the cold start problem
- With many clusters, it is difficult to find stable clusters
  - On the other hand, the degree of personalization increases
- Cannot address the gray sheep problem
  - Since users are assigned to specific clusters, recommendations spanning multiple clusters are not possible
Easy to implement and fast execution speed

Function-based

General models using utility functions that take larger values the more a user likes something, or functions that predict rating values
Reduces to regression, classification, ranking regression, etc.

Regression

About the Model

Decompose the rating matrix R (without missing values) as follows:
U is $k x n$ dimensional, V is $k x m$ dimensional $(k<<m, n)$

R^* \approx U^T V

Column x of U represents features of user x, column y of V represents features of item y
Therefore, user x's rating for item y is:

r^*_{xy} \approx u^T_x v_y

This can be viewed as a similar model to user-based collaborative filtering where u_x = 1/|v|, v_y = item ratings

Training Method

Learn U and V by decomposition to minimize the expected loss between R and R*
Use matrix factorization techniques
- Assume the residual R-R* follows a normal distribution, and compute U and V to maximize the generation probability of observed ratings (factor analysis)
Minimize the objective function
- lambda is a parameter to prevent overfitting
- Train with SGD or ALS (fix other values and update parameters sequentially)

$\text{loss} = \text{min}\bigl(p^*, q^*\bigr) \sum\bigl(r_{ui}-q^T_ip_u\bigr)^2+\lambda \left(\bigl|\bigl|q_i\bigr|\bigr|^2+\bigl|\bigl|p_u\bigr|\bigr|^2\right)$

For missing values, the following approaches are possible:
- Treat as latent variables and solve using the EM algorithm
- Compute only on data without missing values
- Fill with mean values, etc.
- Ignore missing values when computing the loss

Others

Simple enough to extend with privacy-preserving filtering and nonlinear regression for prediction accuracy improvement
Representing the entire system with a single linear model may be too coarse, resulting in low prediction accuracy
- Can be addressed with the Horting method, which constructs local linear models among users with similar preferences

Classification

Ignore the magnitude relationship of rating values and treat them as classes
- Modifications to consider magnitude relationships are also possible
Features can include ratings from other users for the target item and ratings from the target user for other items

Sequential Binary Relation Learning

Acquire a function representing the ease of classification for each class
Classify the target into the class where the function value is maximum

$\hat{r}_{xy} = \operatorname{argmax}_{r \in R}\left( \sum_{i} u_{xi}+\sum_{j} w_{yj} \right)$

where:

$u_{xi}$ is the user-similarity weight for user $i$ evaluating item $y$ with rating $r$
$w_{yj}$ is the item-similarity weight for item $j$ evaluated by user $x$ with rating $r$
u, w are obtained within an online learning framework
- SGD with non-converging learning rate (Momentum), etc.
1. First initialize u, w
2. Each time r is observed, for all users i other than user x who have rated item y:
  - Increase v if the rating is the same, decrease otherwise
3. Also, for all items other than y that user x has rated:
  - Increase weight w if the rating is the same, decrease otherwise

Others

Being online learning, it can handle addition of items and users despite being model-based
Cannot handle deletion of items or users

Ranking

There is research on recommending with RankBoost

k-NN

Uses a model but is not model-based?
Used in the context of CF
Makes recommendations from the ratings of K nearest items/users
- May also take a weighted sum favoring closer users

Rule-based

Could be called a model, but refers to expert models?

Bayesian Classifier (Probabilistic Model)

History-conditioned Type

The conditional expected value of the target user's rating for item y, given the target user's past preference data
- The probability distribution is unknown, so an estimated function is used
- Other sample users' preference data is not referenced in the formula, but it is used in the estimation of the probability distribution, so this qualifies as collaborative filtering-based recommendation

$\hat{r}_{ay} = E[r_{ay} | j \in Y_a] = \sum_{r \in R} P(r_{ay}=r | r_{aj}, j \in Y_a)$

where:

$r_{ay}$ is the rating given by user $a$ to item $y$
$Y_a$ is the set of items rated by user $a$
For preference data with R = 1, this becomes a conditional probability distribution
- Training data is required, but with implicit ratings, noise where r_ay=0 becomes large, which is disadvantageous
- Also, this form requires a model for each individual user, which is impractical

\hat r_{ay} = Pr \bigl[ r_{ay}=1 | s_{aj} ~s.t.~ j \in Y_a \bigr]

In practice, the joint distribution of ratings for all items, independent of individual users, is computed

$P(r_{.y} | y \in Y)$

where $r_{.y}$ is the rating distribution of all users for item $y$

The conditional distribution in the first formula is computed using Bayes' rule and marginalization to eliminate unnecessary variables
- However, with |Y| = m, the number of parameters R^(r) = R1 x R2 x R3 x ... x Rm increases exponentially with m
- Therefore, partial conditional independence between variables r.y is assumed to reduce the parameters to learn
- Probabilistic models with conditional independence are called graphical models because they represent dependencies as graphs
  - Bayesian networks are one such model

Bayesian Networks

Bayesian networks are pre-built from past user behavior data
In production, items to recommend are determined when events occur
- Feed viewed pages to the Bayesian network for inference, computing viewing probabilities
Some research uses them to create user segments
- In this case, the probability of belonging to each segment is obtained
- Clustering can also provide distances from each cluster center
- Bayesian networks do not require data for all nodes
Maximum Weighted Spanning Tree (MWST) Algorithm
- Suitable for large-scale data with good prediction performance, computational complexity is O(n^2) with respect to the number of nodes
- Constructs a spanning tree with the highest conditional probability
- Covered separately below
Pearl's Message Passing Algorithm
- A method for quickly computing marginal posterior probabilities of all nodes in the network
User-item Bayesian
- Compares metadata of items rated by the user with metadata of other items
- Computes the probability of selecting other items
User-user Bayesian
- Compares metadata of items liked by users
- Computes similarity of preferences between users
Item-item Bayesian
- Compares item metadata
- Computes similarity between items

Naive Bayes

Co-occurrence Type

The fact that a user rated an item is viewed as the co-occurrence of the event "the user is x" and the event "the item is y"
This co-occurrence probability is modeled
Uses Probabilistic Latent Semantic Analysis (pLSA)
- Very flexible, allowing extensions to incorporate factors such as rating variation, item information, context information, and personal attribute information
Extending to the Bayesian framework to handle new users yields Latent Dirichlet Allocation (LDA)
Details at http://www.kamishima.net/archive/recsysdoc.pdf P70

MWST Algorithm

A method for constructing a tree based on mutual information between variables
Consists of the following steps:

From the given data, compute the mutual information I(xi; xj) for all Nx(N-1)/2 edges
Extract the edge with the highest mutual information and make it part of the tree
Add the edge with the next highest mutual information to the tree
- However, discard the edge if it creates a loop
Repeat step 3 until N-1 edges are added to the tree
Finally, determine the root node and direct edges from the root toward the leaves
- In a paper recommending web pages (100,000+ pages), the root node was set as the page viewed by the most users
- Building a Bayesian network for all site web pages was computationally infeasible, with a limit of about 5,000 pages
- Site pages were divided into several dozen categories, Bayesian networks were built with category variables, and category inference was performed for users

Advantages

Only uses up to second-order statistics, so computation from data is easy and reliable
Computational complexity is at most O(n^2) with respect to the number of nodes n
Statistically has no consistency, but prediction efficiency is good

Pearl's Message Passing Algorithm

A method for fast probabilistic inference
Generally, in Bayesian networks, after constructing the network or when data is given to nodes, marginalization is needed to compute the posterior probability of a given node
This algorithm efficiently computes this marginalization using the network structure

Send messages from the node that received data to its neighboring nodes
Nodes that receive a message update their marginal posterior probabilities using the received message
Updated nodes send messages to their neighboring nodes (except the sender)
Repeat 2 and 3 to update all nodes' marginal posterior probabilities

At Inference Time

Feed the pages viewed by the user as data into the Bayesian network for inference, computing item preference probabilities for that user
- Use Pearl's Message Passing algorithm for speed
- Arrange items in descending order of computed preference probability

Time Series Models

Maximum Entropy Model

$P(y^{(t+1)} | Y^{(t)}) = \frac{1}{Z}\exp\left[\sum_{k=1}^K \lambda_k f_k(y^{t+1}, Y^{(t)})\right]$

where:

$Z$ is the normalization constant
$\lambda_k$ are weight parameters
$f_k$ are feature functions
Feature functions are functions that return 1 when the purchase history $<Y^(t)>$ has a certain characteristic and the next purchased item is a specific one, and 0 otherwise
The probability of purchasing item y' next given that item y is in the purchase history:

$P(y^{(t+1)}=y' | y \in Y^{(t)}) - P(y^{(t+1)}=y')$

We select items where this difference is largest.

By selecting feature functions this way, model parameters lambda_k can be estimated based on the maximum entropy principle
Various feature functions can be used

Markov Process Model

Predicts using the most recent K items in the sequence
Uses reinforcement learning, etc.

Subscription/Repeat Purchase

Uses techniques such as survival analysis
Cox proportional hazards model

Content-based Filtering

Data Format

There are preference data and search queries
For preference data, items are predicted using user attributes and context attributes as features with machine learning algorithms

References

About recommendation systems in general: https://www.kamishima.net/archive/recsysdoc.pdf
Classification of collaborative filtering: https://eulerdijkstra.hatenadiary.org/entry/20130407/1365349866
Building recommendation systems with CF: https://www.slideshare.net/masayuki1986/recommendation-ml
Japanese research using Bayesian networks for web page recommendation: https://www.ai-gakkai.or.jp/jsai2007/data/pdf/100205.pdf
About FM, FFM, and FM+NN: https://qiita.com/s_katagiri/items/d630685dd2fe2627ecd3
Original FM paper: https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
DeepFM trends (2018): https://data.gunosy.io/entry/deep-factorization-machines-2018#f-95055ec5

Classification​

Collaborative Filtering

Advantages​

Disadvantages​

Procedure​

Problems​

Data Sparsity​

Scalability​

Gray Sheep​

Shilling Attack​

Classification​

Memory-based CF​

Representative Techniques​

Advantages​

Disadvantages​

Model-based CF​

Representative Techniques​

Advantages​

Disadvantages​

Hybrid CF​

Representative Techniques​

Advantages​

Disadvantages​

Data Used​

Explicit Data​

Implicit Data​

Techniques​

SVD​

Definition​

Properties​

SVD ++​

Comparison with Other Dimensionality Reduction Techniques​

Memory-based

Collaborative Filtering to Recommendation Flow​

Examples​

User-based​

Common Improvements​

User-based Specific Improvements​

Item-based​

Item-based Specific Improvements​

Problems​

Differentiating User-based and Item-based​

Factorization Machines​

Prediction Formula​

Prediction​

Optimization Methods​

SVD++​

Original MF (SVD)​

SVD++​

Field-aware FM; FFM​

Prediction Formula​

Training Method​

Deep FM Variants​

Factorization-machine supported Neural Network (FNN)​

Deep Crossing Network (DCN)​

Wide & Deep (arXiv:1606.07792)​

Product-based Neural Network (PNN)​

DeepFM​

Deep & Cross Network (CrossNet)​

xDeepFM​

FFM Extensions​

Field-aware Probabilistic Embedding Neural Network (FPENN)​

Field-weighted Factorization Machine (FwFM)​

Interaction-aware Factorization Machines (IFM)​

Field Attention Deep Field-aware Factorization Machine (FAT-DeepFFM)​

Model-based

Cluster Model​

Method​

Properties​

Function-based​

Regression​

About the Model​

Training Method​

Others​

Classification​

Sequential Binary Relation Learning​

Others​

Ranking​

k-NN​

Rule-based​

Classification

Advantages

Disadvantages

Procedure

Problems

Data Sparsity

Scalability

Gray Sheep

Shilling Attack

Classification

Memory-based CF

Representative Techniques

Advantages

Disadvantages

Model-based CF

Representative Techniques

Advantages

Disadvantages

Hybrid CF

Representative Techniques

Advantages

Disadvantages

Data Used

Explicit Data

Implicit Data

Techniques

SVD

Definition

Properties

SVD ++

Comparison with Other Dimensionality Reduction Techniques

Collaborative Filtering to Recommendation Flow

Examples

User-based

Common Improvements

User-based Specific Improvements

Item-based

Item-based Specific Improvements

Problems

Differentiating User-based and Item-based

Factorization Machines

Prediction Formula

Prediction

Optimization Methods

SVD++

Original MF (SVD)

SVD++

Field-aware FM; FFM

Prediction Formula

Training Method

Deep FM Variants

Factorization-machine supported Neural Network (FNN)

Deep Crossing Network (DCN)

Wide & Deep (arXiv:1606.07792)

Product-based Neural Network (PNN)

DeepFM

Deep & Cross Network (CrossNet)

xDeepFM

FFM Extensions

Field-aware Probabilistic Embedding Neural Network (FPENN)

Field-weighted Factorization Machine (FwFM)

Interaction-aware Factorization Machines (IFM)

Field Attention Deep Field-aware Factorization Machine (FAT-DeepFFM)

Cluster Model

Method

Properties

Function-based

Regression

About the Model

Training Method

Others

Classification

Sequential Binary Relation Learning

Others

Ranking

k-NN

Rule-based

Bayesian Classifier (Probabilistic Model)

History-conditioned Type

Bayesian Networks