Skip to main content

Recommendation Models

A summary of recent trends in recommendation models, also known as recommender systems.

  • Various descriptions exist, and the scope of terminology may differ depending on the author

  • w,x\langle w, x\rangle denotes the inner product of w and x

Classification

  • Recommendation systems are broadly divided into 3 types:
    • Rule-based: Does not use other users' information at inference time
    • Memory-based: Uses other users' information at both training and inference time
    • Model-based: Uses other users' information at training time but not at inference time
Rule-basedMemory-basedModel-based
Degree of personalizationLowHighDepends on the model
MethodRecommend according to rulesCompute from overall item/user information for each recommendationInput target user's information into the model for each recommendation
TechniquesCollaborative filteringCluster models, function-based, Bayesian
  • Other classifications:
    • No personalization: Same recommendation for all users
    • Temporary personalization: Recommendations based only on session data (temporary data)
    • Persistent personalization: Recommendations based on long-term data (history, etc.)
    • Non-updating personalization: Inference using basic information registered by the user that is generally not updated, or data valid only for the current session
-No personalizationTemporary personalizationPersistent personalizationNon-updating personalization
ItemSpec ranking (by price)Similar/complementary items in the cartSimilar trending items from historyNon-interactive chatbot / salesperson suggestions
PeoplePopularity rankingFrequently bought togetherCollaborative filtering (similar trending items from ratings/behavior)Surveys, etc.
KnowledgeStore/season-specific recommendationsCheckout recommendationsRare / expert recommendations for individualsChatbot / salesperson suggestions

Collaborative Filtering

  • A technique that uses user rating (behavior) information to recommend items based on information from other users similar to the target user

Advantages

  • No domain knowledge required
  • Can recommend across different genres (serendipity)

Disadvantages

  • Cold start problem

    • Cannot recommend for new users or items with few purchases (actions) at the start of the service
  • Low coverage

    • Items recommended tend to be those that are frequently purchased
    • Items that have not been purchased cannot be recommended
  • Cannot recommend similar items

    • Item similarity is unknown

Procedure

  1. Accumulate user rating (behavior) information
    • If not accumulated, use rule-based or content-based systems
  2. Find users with similar preferences from rating (behavior) information
    • See the similarity metrics survey for how to calculate similarity
    • When there are many users, pre-clustering users by behavior history can save computational resources
  3. Recommend items that similar users like and that are predicted to have high ratings for the target user
    • k-NN and similar methods are used here

Problems

Data Sparsity

  • When there are very many items, users do not purchase most products
    • Therefore, even if users are actually similar, they cannot be identified as similar unless they have both rated the exact same items (transitive similarity problem)
  • Solutions include the following:
    • Dimensionality reduction via SVD
    • Dimensionality reduction via matrix factorization (see the Model / Function-based / Regression section)
    • Complementation via hybrid collaborative filtering
      • Calculate similarity based on items and users without using user history
    • Data complementation through predictive model construction
      • Predict and fill in evaluation values for unrated items

Scalability

  • When the number of items or users grows very rapidly, computational cost increases
  • Solutions include the following:
    • Before using CF, cluster similar items/users using item-based/user-based approaches to narrow down computation targets (similar to hybrid collaborative filtering)

Gray Sheep

  • Users whose preferences match multiple patterns or none at all, and therefore do not benefit from filtering
  • Solutions include the following:
    • Differentiation through behavior patterns other than collaborative filtering via hybrid CF

Shilling Attack

  • Refers to attacks on the rating system (such as fake reviews)
  • Solutions include the following:
    • Use item-based collaborative filtering
      • Less susceptible than user-based approaches
    • Use a hybrid of item-based + CF collaborative filtering

Classification

  • Can be classified as memory-based, model-based, or hybrid

Memory-based CF

  • Makes recommendations based on ratings from similar users, using correlations between users and items

Representative Techniques

  • User-based
    • Analyzes similarity between users based on user behavior history and recommends items with high ratings
  • Item-based
    • Analyzes similarity between items based on user behavior history and recommends highly related items

Advantages

  • Easy to implement
  • Easy to add data
  • No need to consider content
  • No scaling of co-rated items required

Disadvantages

  • Dependence on human ratings

    • Accuracy improves by standardizing ratings
  • Performance degradation with sparse data

  • Cold start

  • Scalability

    • Can be somewhat improved by pre-clustering users and items

    • However, updates at a certain frequency are necessary

Model-based CF

  • Models user purchasing behavior using statistical algorithms and makes recommendations based on the model

Representative Techniques

  • Bayesian network CF
  • Clustering CF
  • Latent semantic CF
  • Factor analysis
  • CF with dimensionality reduction

Advantages

  • Performance with sparse data
  • Improved prediction performance
  • Intuitive explanations

Disadvantages

  • Complex model construction
  • Scalability (faster than memory-based)
    • Can update nightly or weekly/monthly
  • Adaptability
    • Weak to changes in users and items
      • Cannot adapt without model updates
  • Loss of useful information due to dimensionality reduction

Hybrid CF

  • Uses both (non-CF) content-based and CF approaches for recommendations

Representative Techniques

  • Content-based CF
  • Memory-based + model-based CF

Advantages

  • Can overcome individual problems
  • Improved prediction performance
  • Addresses sparsity and gray sheep issues

Disadvantages

  • Complex construction
  • May require external data (information other than user behavior information) in some cases

Data Used

Explicit Data

  • Users explicitly provide ratings
  • Good/Bad, n-level ratings
  • Small data volume but accurate, with clear distinction between unrated and disliked

Implicit Data

  • Automatically collected information such as browsing data and purchase data
  • Large data volume but inaccurate, with unclear distinction between unrated and disliked

Techniques

SVD

Definition

  • For any m x n real matrix A (rank=k), there exist an m x m orthogonal matrix U, an n x n orthogonal matrix V, and a matrix S with non-diagonal elements equal to 0 and diagonal elements greater than 0 (if the column is k or less) or equal to 0 (if the column is greater than k), such that A can be decomposed as follows:

    • The components of the difference from the matrix below (for matrix U, the rows from r to m) are all 0
  • The singular value decomposition of A is expressed as:

A=UΓVTA:m×n, U:m×r, Γ:r×r, VT:r×n\begin{aligned} A &= U \Gamma V^T \\ A &: m \times n, ~U: m \times r,~ \Gamma: r \times r,~ V^T: r \times n \end{aligned}
  • U, V are orthogonal matrices, \Gamma is a diagonal matrix, and the diagonal elements of \Gamma are the singular values of A (> 0)
  • Dimensionality is reduced by truncating \Gamma entries whose diagonal elements are below a certain threshold

Properties

  • The following matrix Aj* is the least squares approximation of rank j for A
Aj=i=1jλjuiviA_{j*} = \sum^j_{i=1}\lambda_ju_iv_i
  • \Gamma is uniquely determined for A, but U and V are not necessarily uniquely determined

SVD ++

  • Positioned as an evolution of MF as dimensionality reduction, viewed as SVD with regularization
  • See Memory-based / Factorization Machines / SVD++

Comparison with Other Dimensionality Reduction Techniques

  • NMF
    • A method with the constraint that coefficients must be non-negative
    • Not suitable because post-decomposition coefficients become sparse
  • PCA
    • Can take values below 0
    • However, in practice, browsing, ratings, and behavior are non-negative and never go below 0
  • LDA
    • The assumption is not met because data does not follow a multivariate normal distribution
  • ICA
    • The assumption is not met because ratings/behavior history are not statistically independent of each other

Memory-based

Collaborative Filtering to Recommendation Flow

  1. Calculate similarity scores between users/items
    1. Define what constitutes similarity
    2. Prepare data according to the definition
    3. Select and implement a calculation method according to the definition
    4. Actually calculate similarity scores
  2. Make recommendations weighted by similarity scores
    1. Weight by similarity scores (increase the influence of similar users/items)
    2. Normalize
    3. Sort appropriately and display results in order of highest score

Examples

User-based

---Item 1Item 2Item 3Item 4Item 5Item 6Item 7Item 8Similarity
User A10x10x101.0
User B001110010.5
User C0010100x0.25
User D010x10100.0
Target user10x10xxx-
Score (n=1)--x--x10ref A
n=2--1--00.50.5ref A&B
  • Similarity is simplified using match rate (proportion of matching values)

  • Evaluation is simplified to 1 (purchased or not, liked or not)

  • K-NN (n=1): Recommend Item 7, which is highly rated by the most similar User A

  • K-NN (n=2): Recommend Item 3, which has the highest average rating from similar Users A and B

Common Improvements

  • There are many improvements in computing similarity between users and calculating which items to recommend from similar users
    • Include whether items were viewed in the similarity calculation
    • Weight by similarity when calculating items to recommend
    • Weight by the number of raters, etc.
  • For sparse data, apply dimensionality reduction
    • When using PCA, fill missing values with default voting values (see below) or the EM algorithm
  • To reduce dependence on rating values, convert to rankings and use ranking-based similarity metrics

User-based Specific Improvements

  • When there are few common raters per user, fill unrated items with the overall rating for that item from all users (default voting value)
  • Give more weight to extreme ratings (minimum and maximum)

Item-based

Item 1Item 2Item 3Item 4Item 5Item 6Item 7Item 8
User A10x10x10
User B00111001
User C0010100x
User D010x1010
Target user10x10xxx
Similarity with Item 1--0.33--1.000.750.33
Similarity with Item 4--0.5--0.50.670.5
Similarity with highly rated items--0.42--0.750.710.42
  • Simplified using the same calculation method as user-based

  • Improvement methods are also similar to user-based

  • Using items 4 that the target user rated highly, calculate similarity of unrated items

  • Recommend Item 6, which has the highest similarity

Item-based Specific Improvements

  • By recommending items similar to co-occurrence, browsing, shopping, and recent usage history, temporary personalization can be achieved

  • Advantageous over user-based in terms of privacy protection

Problems

  • In situations where recommended items change frequently (such as movies and concerts), frequent neighbor updates are required
  • Experimentally tends to be biased toward specific items (though there is no theoretical basis)

Differentiating User-based and Item-based

  • User-based
    • Fits in memory with frequent changes
    • Each user's preferences have unique values
      • Sites that recommend specific things
    • High serendipity (discovering unexpected items)
  • Item-based
    • Targets large data
    • When there are many items, it is difficult to compute similarity between users
      • Few people buy the same items, or the matrix becomes sparse
    • Item-to-item similarity can be pre-computed, making production recommendations easier
    • Item-to-item similarity changes less frequently, requiring fewer recalculations
    • Easy to explain

Factorization Machines

  • A general-purpose model presented at ICDM in 2010

  • Efficiently computes feature interactions

    • When there are features representing age and gender, the AND condition feature of age + gender is represented by the interaction between age and gender
    • The interaction weight is learned as the inner product of vectors corresponding to each feature
    • Higher-order FMs have been proposed, but as of 2018, accuracy improvements only go up to around 3rd order
  • Somewhat different in concept from CF

user Auser Buser C-Item aItem b
10010
10001
01010
01001
00101
  • Each rating is represented as one row, and the user-item interaction feature vector is extracted
  • Interactions can freely include time, context, etc.
    • For example, adding Age allows the model to handle User x Item x Age interactions

Prediction Formula

  • 2nd-order FM
y^(x):=w0+i=1nwixi+i=1nj=i+1nvi,vjxixjwhere  w0R,  wRn,  VRn×kvi,vj:=f=1kvi,fvj,f\begin{aligned} \hat y(x) &:= w_0 + \sum^n_{i=1}w_ix_i+\sum^n_{i=1}\sum^n_{j=i+1}\langle v_i,v_j\rangle x_ix_j \\ \text{where} \ \ w_0 &\in \mathbb{R},\ \ w \in \mathbb{R}^n,\ \ V \in \mathbb{R}^{n\times k} \\ \langle v_i,v_j\rangle &:= \sum^k_{f=1}v_{i,f} v_{j,f} \end{aligned}

Row v_i in V is the i-th variable with k factors

k in N+0 is a hyperparameter defining the dimensionality of the factorization

  • Reference: Standard 2nd-order linear model with interactions
    • When directly applied to recommendation data, xixjx_ix_j is sparse, making it difficult to learn wi,jw_{i,j}
    • Therefore, that part is learned as an inner product of vectors
      • This also reduces the number of learning parameters
y^(x):=w0+i=1nwixi+inj=i+1nwi,jxixj\hat y(x) := w_0 + \sum^n_{i=1}w_ix_i+\sum^n_{i}\sum^n_{j=i+1}w_{i,j}x_ix_j
  • 2-way FM captures single-variable + pairwise interactions between variables
  • w0 is the global bias
  • wi models the strength of the i-th variable
  • wi,j:=<vi,vj>w^i,j := <v_i,v_j> models the interaction between the i-th and j-th variables
  • Instead of using a parameter w_i,j for each interaction, the FM model factorizes to model interactions
    • This is important for sparse vectors

Prediction

  • Regression
    • Use y^(x) directly as the prediction target and train with MSE, etc.
  • Binary classification
    • Use y^(x) as the label and train with logistic regression or hinge loss
  • Ranking
    • Rank vector x by y(x)y^{(x)} and train with pairwise classification loss for pairs x(a),x(b)Dx(a), x(b) \in D

Optimization Methods

  • SGD, ALS, MCMC, etc.
  • Can train with various losses including MSE, logistic loss, hinge loss
  • Use L2 regularization to prevent overfitting
    • Early stopping may give better accuracy
  • The gradient is expressed as follows (for the 1st order case):
\begin{equation} \frac{d}{d\theta}\hat y(x)= \Biggl\{ \begin{array}{1} 1, ~if~\theta~is~w_0 \\ x_i, ~if~\theta~is~w_i \\ x_i\sum^n_{i=1}v_{i,f}x_j- v_{i,f}x^2_i,~if~\theta~is~v_{i,f} \end{array} \end{equation}

The partial derivative d is written as plain d due to formula rendering issues.

SVD++

Original MF (SVD)

r^ui=μ+buser(u)+bitem(i)+UuIiwhere  r^ui:predicted ratingbuser:individual biasI:item attributes(IRk×m)μ:global meanbitem:item biasU:user attributes(URu×k)\begin{aligned} \hat r_{ui} &= \mu + b_{user}(u)+b_{item}(i)+U_uI_i \\ \text{where} \ \ \hat r_{ui}&: \text{predicted rating} \\ b_{user}&: \text{individual bias} \\ I&: \text{item attributes}(I \in R^{k\times m}) \\ \mu&: \text{global mean} \\ b_{item}&: \text{item bias} \\ U&: \text{user attributes}(U \in R^{u \times k}) \end{aligned} loss=minb,U,Iu,iR(ruir^ui)2+λ(buser2+bitem2+UuT2+Ii2)loss = \min_{b,U,I}\sum_{u,i\in R}(r_{ui}-\hat r_{ui})^2 + \lambda(|b_{user}|^2+|b_{item}|^2+|U^T_u|^2+|I_i|^2)

SVD++

r^ui=μ+buser(u)+bitem(i)+(Uu+R(u)0.5jR(u)yj)Ii\hat{r}_{ui} = \mu + b_{user}(u)+b_{item}(i)+(U_u+|R(u)|^{-0.5}\sum_{j \in R(u)}y_j)I_i

where R(u)R(u) is the set of items rated by user uu and yjy_j is the effect of a user action on item jj.

The loss function is:

loss=argmin[(ruir^ui)2+λ1(b2+U2)+λ2(I2+y2)]\text{loss} = \text{argmin}\left[\sum(r_{ui}-\hat{r}_{ui})^2+\lambda_1(||b||^2+||U||^2)+\lambda_2(||I||^2+||y||^2)\right]

Field-aware FM; FFM

Prediction Formula

  • An extension of FM
    • FM learns one embedding vector per feature
    • FFM improves this by dividing by Field (original category classification) and learning Field-number of embedding vectors per feature
  • FM assumed one-hot encoding of categorical variables as features, so different attribute information was uniformly converted to 1 values
    • Information about which original categorical variable each feature belongs to was lost
    • FFM can express field-specific interaction factors
y^(x):=w0+i=1nwixi+i=1nj=i+1nvifj,vjfixixj\hat{y}(x) := w_0 + \sum_{i=1}^n w_i x_i + \sum_{i=1}^n \sum_{j=i+1}^n \langle v_{if_j}, v_{jf_i} \rangle x_i x_j

where fif_i and fjf_j denote the fields that ii and jj belong to respectively.

Training Method

  • Same as FM
  • However, due to the increased number of parameters, it is more prone to overfitting, so early stopping is used

Deep FM Variants

Factorization-machine supported Neural Network (FNN)

  • ECIR2016

  • Standard MLP

    • Categorical input data is one-hot encoded per variable, then embedded and fed into a dense network
    • FM weights and vectors are used to initialize layer weights

Deep Crossing Network (DCN)

  • KDD2016
  • Replaces the MLP part of FNN with ResNet-like skip connections to predict residuals
  • Since experiments were on internal data, it is not widely cited, but later research shows higher accuracy than FNN

Wide & Deep (arXiv:1606.07792)

  • 2016
  • Uses embedded categorical values, same as FNN and DCN
  • Attaches a linear model at the output layer
  • Better accuracy than FNN and DCN, plus a TensorFlow API exists

Product-based Neural Network (PNN)

  • ICDM2017

  • Embedding categorical variables makes it difficult to handle explicit interactions between categorical variables

  • Incorporates categorical variable interaction terms into the model architecture

  • Creates a layer for interactions between categorical variables

  • PNN is more accurate than FNN with stacked embeddings

    • (Possibly weaker than Wide & Deep)

DeepFM

  • IJCAI2017

  • Evolution of Wide & Deep?

  • Feed sparse features into an embedding layer

    • Branch 1: Feed into Factorization Machine
    • Branch 2: Feed into a dense network
  • Combine Branch 1 and 2 for output

  • More accurate than FM & Deep (version without embedding the FM input)

Deep & Cross Network (CrossNet)

  • ADKDD2017
  1. Use continuous values as-is, embed categorical variables as features
  2. Branch as follows:
    1. Branch 1: Feed into a Cross Network that considers interaction terms
    2. Branch 2: Feed into a dense network
  3. Combine Branches 1 and 2, apply FC, then output with Sigmoid

xDeepFM

  • KDD2018
  • Learns higher-order features explicitly at the vector level using an architecture called Compressed Interaction Network (CIN)
  1. Branch:
    1. Linear layer
    2. Embedding
  2. Branch from 1-2 Embedding:
    1. CIN
    2. Plain DNN
  3. Combine 1-1, 2-1, 2-2 for output

FFM Extensions

Field-aware Probabilistic Embedding Neural Network (FPENN)

  • RecSys2018
  • Prevents overfitting + introduces NN
  • Same approach of learning Field-number of embeddings
    • However, estimates the parameters of the probability density that the embedding follows
    • This enables robust learning against noise in the training data
  • Combination with Deep follows DeepFM (?)

Field-weighted Factorization Machine (FwFM)

  • WWW2018
  • The importance for prediction varies for each pair of fields
    • Therefore, the importance of each field pair is also formalized

Interaction-aware Factorization Machines (IFM)

  • AAAI 2019
  • Attention + FM
  • Considers attention at both feature level and field level
  • Since attention is not well understood, this model and the one below are not fully grasped

Field Attention Deep Field-aware Factorization Machine (FAT-DeepFFM)

  • ICDM 2019

  • Attention + FFM

  • Considers attention at the embedding stage

Model-based

Cluster Model

Method

  1. Create clusters of users with similar preference patterns
    • Uses item rating values, etc.
  2. Find the most similar cluster for the user
  3. Recommend items in order of highest average rating within the cluster

Properties

  • The nature of recommendations varies greatly depending on the number of clusters
    • The number of clusters needs to be adjusted according to the objective
  • The types of recommendations are limited by the number of clusters
    • With few clusters, recommendations are broad and not very personalized
      • On the other hand, more robust against the cold start problem
    • With many clusters, it is difficult to find stable clusters
      • On the other hand, the degree of personalization increases
    • Cannot address the gray sheep problem
      • Since users are assigned to specific clusters, recommendations spanning multiple clusters are not possible
  • Easy to implement and fast execution speed

Function-based

  • General models using utility functions that take larger values the more a user likes something, or functions that predict rating values
  • Reduces to regression, classification, ranking regression, etc.

Regression

About the Model

  • Decompose the rating matrix R (without missing values) as follows:

  • U is kxnk x n dimensional, V is kxmk x m dimensional (k<<m,n)(k<<m, n)

RUTVR^* \approx U^T V
  • Column x of U represents features of user x, column y of V represents features of item y
  • Therefore, user x's rating for item y is:
rxyuxTvyr^*_{xy} \approx u^T_x v_y
  • This can be viewed as a similar model to user-based collaborative filtering where u_x = 1/|v|, v_y = item ratings

Training Method

  • Learn U and V by decomposition to minimize the expected loss between R and R*
  • Use matrix factorization techniques
    • Assume the residual R-R* follows a normal distribution, and compute U and V to maximize the generation probability of observed ratings (factor analysis)
  • Minimize the objective function
    • lambda is a parameter to prevent overfitting
    • Train with SGD or ALS (fix other values and update parameters sequentially)

loss=min(p,q)(ruiqiTpu)2+λ(qi2+pu2)\text{loss} = \text{min}\bigl(p^*, q^*\bigr) \sum\bigl(r_{ui}-q^T_ip_u\bigr)^2+\lambda \left(\bigl|\bigl|q_i\bigr|\bigr|^2+\bigl|\bigl|p_u\bigr|\bigr|^2\right)

  • For missing values, the following approaches are possible:
    • Treat as latent variables and solve using the EM algorithm
    • Compute only on data without missing values
    • Fill with mean values, etc.
    • Ignore missing values when computing the loss

Others

  • Simple enough to extend with privacy-preserving filtering and nonlinear regression for prediction accuracy improvement

  • Representing the entire system with a single linear model may be too coarse, resulting in low prediction accuracy

    • Can be addressed with the Horting method, which constructs local linear models among users with similar preferences

Classification

  • Ignore the magnitude relationship of rating values and treat them as classes
    • Modifications to consider magnitude relationships are also possible
  • Features can include ratings from other users for the target item and ratings from the target user for other items

Sequential Binary Relation Learning

  • Acquire a function representing the ease of classification for each class
  • Classify the target into the class where the function value is maximum

r^xy=argmaxrR(iuxi+jwyj)\hat{r}_{xy} = \operatorname{argmax}_{r \in R}\left( \sum_{i} u_{xi}+\sum_{j} w_{yj} \right)

where:

  • uxiu_{xi} is the user-similarity weight for user ii evaluating item yy with rating rr

  • wyjw_{yj} is the item-similarity weight for item jj evaluated by user xx with rating rr

  • u, w are obtained within an online learning framework

    • SGD with non-converging learning rate (Momentum), etc.

    1. First initialize u, w
    2. Each time r is observed, for all users i other than user x who have rated item y:
      • Increase v if the rating is the same, decrease otherwise
    3. Also, for all items other than y that user x has rated:
      • Increase weight w if the rating is the same, decrease otherwise

Others

  • Being online learning, it can handle addition of items and users despite being model-based

  • Cannot handle deletion of items or users

Ranking

  • There is research on recommending with RankBoost

k-NN

  • Uses a model but is not model-based?

  • Used in the context of CF

  • Makes recommendations from the ratings of K nearest items/users

    • May also take a weighted sum favoring closer users

Rule-based

  • Could be called a model, but refers to expert models?

Bayesian Classifier (Probabilistic Model)

History-conditioned Type

  • The conditional expected value of the target user's rating for item y, given the target user's past preference data
    • The probability distribution is unknown, so an estimated function is used
    • Other sample users' preference data is not referenced in the formula, but it is used in the estimation of the probability distribution, so this qualifies as collaborative filtering-based recommendation

r^ay=E[rayjYa]=rRP(ray=rraj,jYa)\hat{r}_{ay} = E[r_{ay} | j \in Y_a] = \sum_{r \in R} P(r_{ay}=r | r_{aj}, j \in Y_a)

where:

  • rayr_{ay} is the rating given by user aa to item yy

  • YaY_a is the set of items rated by user aa

  • For preference data with R = 1, this becomes a conditional probability distribution

    • Training data is required, but with implicit ratings, noise where r_ay=0 becomes large, which is disadvantageous
    • Also, this form requires a model for each individual user, which is impractical
r^ay=Pr[ray=1saj s.t. jYa]\hat r_{ay} = Pr \bigl[ r_{ay}=1 | s_{aj} ~s.t.~ j \in Y_a \bigr]
  • In practice, the joint distribution of ratings for all items, independent of individual users, is computed

P(r.yyY)P(r_{.y} | y \in Y)

where r.yr_{.y} is the rating distribution of all users for item yy

  • The conditional distribution in the first formula is computed using Bayes' rule and marginalization to eliminate unnecessary variables
    • However, with |Y| = m, the number of parameters R^(r) = R1 x R2 x R3 x ... x Rm increases exponentially with m
    • Therefore, partial conditional independence between variables r.y is assumed to reduce the parameters to learn
    • Probabilistic models with conditional independence are called graphical models because they represent dependencies as graphs
      • Bayesian networks are one such model

Bayesian Networks

  • Bayesian networks are pre-built from past user behavior data

  • In production, items to recommend are determined when events occur

    • Feed viewed pages to the Bayesian network for inference, computing viewing probabilities
  • Some research uses them to create user segments

    • In this case, the probability of belonging to each segment is obtained
    • Clustering can also provide distances from each cluster center
    • Bayesian networks do not require data for all nodes
  • Maximum Weighted Spanning Tree (MWST) Algorithm

    • Suitable for large-scale data with good prediction performance, computational complexity is O(n^2) with respect to the number of nodes
    • Constructs a spanning tree with the highest conditional probability
    • Covered separately below
  • Pearl's Message Passing Algorithm

    • A method for quickly computing marginal posterior probabilities of all nodes in the network
  • User-item Bayesian

    • Compares metadata of items rated by the user with metadata of other items
    • Computes the probability of selecting other items
  • User-user Bayesian

    • Compares metadata of items liked by users
    • Computes similarity of preferences between users
  • Item-item Bayesian

    • Compares item metadata
    • Computes similarity between items

Naive Bayes

Co-occurrence Type

  • The fact that a user rated an item is viewed as the co-occurrence of the event "the user is x" and the event "the item is y"
  • This co-occurrence probability is modeled
  • Uses Probabilistic Latent Semantic Analysis (pLSA)
    • Very flexible, allowing extensions to incorporate factors such as rating variation, item information, context information, and personal attribute information
  • Extending to the Bayesian framework to handle new users yields Latent Dirichlet Allocation (LDA)
  • Details at http://www.kamishima.net/archive/recsysdoc.pdf P70

MWST Algorithm

  • A method for constructing a tree based on mutual information between variables
  • Consists of the following steps:
  1. From the given data, compute the mutual information I(xi; xj) for all Nx(N-1)/2 edges
  2. Extract the edge with the highest mutual information and make it part of the tree
  3. Add the edge with the next highest mutual information to the tree
    • However, discard the edge if it creates a loop
  4. Repeat step 3 until N-1 edges are added to the tree
  5. Finally, determine the root node and direct edges from the root toward the leaves
    • In a paper recommending web pages (100,000+ pages), the root node was set as the page viewed by the most users
    • Building a Bayesian network for all site web pages was computationally infeasible, with a limit of about 5,000 pages
    • Site pages were divided into several dozen categories, Bayesian networks were built with category variables, and category inference was performed for users

Advantages

  • Only uses up to second-order statistics, so computation from data is easy and reliable
  • Computational complexity is at most O(n^2) with respect to the number of nodes n
  • Statistically has no consistency, but prediction efficiency is good

Pearl's Message Passing Algorithm

  • A method for fast probabilistic inference
  • Generally, in Bayesian networks, after constructing the network or when data is given to nodes, marginalization is needed to compute the posterior probability of a given node
  • This algorithm efficiently computes this marginalization using the network structure
  1. Send messages from the node that received data to its neighboring nodes
  2. Nodes that receive a message update their marginal posterior probabilities using the received message
  3. Updated nodes send messages to their neighboring nodes (except the sender)
  4. Repeat 2 and 3 to update all nodes' marginal posterior probabilities

At Inference Time

  • Feed the pages viewed by the user as data into the Bayesian network for inference, computing item preference probabilities for that user
    • Use Pearl's Message Passing algorithm for speed
    • Arrange items in descending order of computed preference probability

Time Series Models

Maximum Entropy Model

P(y(t+1)Y(t))=1Zexp[k=1Kλkfk(yt+1,Y(t))]P(y^{(t+1)} | Y^{(t)}) = \frac{1}{Z}\exp\left[\sum_{k=1}^K \lambda_k f_k(y^{t+1}, Y^{(t)})\right]

where:

  • ZZ is the normalization constant

  • λk\lambda_k are weight parameters

  • fkf_k are feature functions

  • Feature functions are functions that return 1 when the purchase history <Y(t)><Y^(t)> has a certain characteristic and the next purchased item is a specific one, and 0 otherwise

  • The probability of purchasing item y' next given that item y is in the purchase history:

P(y(t+1)=yyY(t))P(y(t+1)=y)P(y^{(t+1)}=y' | y \in Y^{(t)}) - P(y^{(t+1)}=y')

We select items where this difference is largest.

  • By selecting feature functions this way, model parameters lambda_k can be estimated based on the maximum entropy principle
  • Various feature functions can be used

Markov Process Model

  • Predicts using the most recent K items in the sequence
  • Uses reinforcement learning, etc.

Subscription/Repeat Purchase

  • Uses techniques such as survival analysis
  • Cox proportional hazards model

Content-based Filtering

Data Format

  • There are preference data and search queries
  • For preference data, items are predicted using user attributes and context attributes as features with machine learning algorithms

References