Recommendation Models
A summary of recent trends in recommendation models, also known as recommender systems.
-
Various descriptions exist, and the scope of terminology may differ depending on the author
-
denotes the inner product of w and x
Classification
- Recommendation systems are broadly divided into 3 types:
- Rule-based: Does not use other users' information at inference time
- Memory-based: Uses other users' information at both training and inference time
- Model-based: Uses other users' information at training time but not at inference time
| Rule-based | Memory-based | Model-based | |
|---|---|---|---|
| Degree of personalization | Low | High | Depends on the model |
| Method | Recommend according to rules | Compute from overall item/user information for each recommendation | Input target user's information into the model for each recommendation |
| Techniques | Collaborative filtering | Cluster models, function-based, Bayesian |
- Other classifications:
- No personalization: Same recommendation for all users
- Temporary personalization: Recommendations based only on session data (temporary data)
- Persistent personalization: Recommendations based on long-term data (history, etc.)
- Non-updating personalization: Inference using basic information registered by the user that is generally not updated, or data valid only for the current session
| - | No personalization | Temporary personalization | Persistent personalization | Non-updating personalization |
|---|---|---|---|---|
| Item | Spec ranking (by price) | Similar/complementary items in the cart | Similar trending items from history | Non-interactive chatbot / salesperson suggestions |
| People | Popularity ranking | Frequently bought together | Collaborative filtering (similar trending items from ratings/behavior) | Surveys, etc. |
| Knowledge | Store/season-specific recommendations | Checkout recommendations | Rare / expert recommendations for individuals | Chatbot / salesperson suggestions |
Collaborative Filtering
- A technique that uses user rating (behavior) information to recommend items based on information from other users similar to the target user
Advantages
- No domain knowledge required
- Can recommend across different genres (serendipity)
Disadvantages
-
Cold start problem
- Cannot recommend for new users or items with few purchases (actions) at the start of the service
-
Low coverage
- Items recommended tend to be those that are frequently purchased
- Items that have not been purchased cannot be recommended
-
Cannot recommend similar items
- Item similarity is unknown
Procedure
- Accumulate user rating (behavior) information
- If not accumulated, use rule-based or content-based systems
- Find users with similar preferences from rating (behavior) information
- See the similarity metrics survey for how to calculate similarity
- When there are many users, pre-clustering users by behavior history can save computational resources
- Recommend items that similar users like and that are predicted to have high ratings for the target user
- k-NN and similar methods are used here
Problems
Data Sparsity
- When there are very many items, users do not purchase most products
- Therefore, even if users are actually similar, they cannot be identified as similar unless they have both rated the exact same items (transitive similarity problem)
- Solutions include the following:
- Dimensionality reduction via SVD
- Dimensionality reduction via matrix factorization (see the Model / Function-based / Regression section)
- Complementation via hybrid collaborative filtering
- Calculate similarity based on items and users without using user history
- Data complementation through predictive model construction
- Predict and fill in evaluation values for unrated items
Scalability
- When the number of items or users grows very rapidly, computational cost increases
- Solutions include the following:
- Before using CF, cluster similar items/users using item-based/user-based approaches to narrow down computation targets (similar to hybrid collaborative filtering)
Gray Sheep
- Users whose preferences match multiple patterns or none at all, and therefore do not benefit from filtering
- Solutions include the following:
- Differentiation through behavior patterns other than collaborative filtering via hybrid CF
Shilling Attack
- Refers to attacks on the rating system (such as fake reviews)
- Solutions include the following:
- Use item-based collaborative filtering
- Less susceptible than user-based approaches
- Use a hybrid of item-based + CF collaborative filtering
- Use item-based collaborative filtering
Classification
- Can be classified as memory-based, model-based, or hybrid
Memory-based CF
- Makes recommendations based on ratings from similar users, using correlations between users and items
Representative Techniques
- User-based
- Analyzes similarity between users based on user behavior history and recommends items with high ratings
- Item-based
- Analyzes similarity between items based on user behavior history and recommends highly related items
Advantages
- Easy to implement
- Easy to add data
- No need to consider content
- No scaling of co-rated items required
Disadvantages
-
Dependence on human ratings
- Accuracy improves by standardizing ratings
-
Performance degradation with sparse data
-
Cold start
-
Scalability
-
Can be somewhat improved by pre-clustering users and items
-
However, updates at a certain frequency are necessary
-
Model-based CF
- Models user purchasing behavior using statistical algorithms and makes recommendations based on the model
Representative Techniques
- Bayesian network CF
- Clustering CF
- Latent semantic CF
- Factor analysis
- CF with dimensionality reduction
Advantages
- Performance with sparse data
- Improved prediction performance
- Intuitive explanations
Disadvantages
- Complex model construction
- Scalability (faster than memory-based)
- Can update nightly or weekly/monthly
- Adaptability
- Weak to changes in users and items
- Cannot adapt without model updates
- Weak to changes in users and items
- Loss of useful information due to dimensionality reduction
Hybrid CF
- Uses both (non-CF) content-based and CF approaches for recommendations
Representative Techniques
- Content-based CF
- Memory-based + model-based CF
Advantages
- Can overcome individual problems
- Improved prediction performance
- Addresses sparsity and gray sheep issues
Disadvantages
- Complex construction
- May require external data (information other than user behavior information) in some cases
Data Used
Explicit Data
- Users explicitly provide ratings
- Good/Bad, n-level ratings
- Small data volume but accurate, with clear distinction between unrated and disliked
Implicit Data
- Automatically collected information such as browsing data and purchase data
- Large data volume but inaccurate, with unclear distinction between unrated and disliked
Techniques
SVD
Definition
-
For any m x n real matrix A (rank=k), there exist an m x m orthogonal matrix U, an n x n orthogonal matrix V, and a matrix S with non-diagonal elements equal to 0 and diagonal elements greater than 0 (if the column is k or less) or equal to 0 (if the column is greater than k), such that A can be decomposed as follows:
- The components of the difference from the matrix below (for matrix U, the rows from r to m) are all 0
-
The singular value decomposition of A is expressed as:
- U, V are orthogonal matrices, \Gamma is a diagonal matrix, and the diagonal elements of \Gamma are the singular values of A (> 0)
- Dimensionality is reduced by truncating \Gamma entries whose diagonal elements are below a certain threshold
Properties
- The following matrix Aj* is the least squares approximation of rank j for A
- \Gamma is uniquely determined for A, but U and V are not necessarily uniquely determined
SVD ++
- Positioned as an evolution of MF as dimensionality reduction, viewed as SVD with regularization
- See Memory-based / Factorization Machines / SVD++
Comparison with Other Dimensionality Reduction Techniques
- NMF
- A method with the constraint that coefficients must be non-negative
- Not suitable because post-decomposition coefficients become sparse
- PCA
- Can take values below 0
- However, in practice, browsing, ratings, and behavior are non-negative and never go below 0
- LDA
- The assumption is not met because data does not follow a multivariate normal distribution
- ICA
- The assumption is not met because ratings/behavior history are not statistically independent of each other
Memory-based
Collaborative Filtering to Recommendation Flow
- Calculate similarity scores between users/items
- Define what constitutes similarity
- Prepare data according to the definition
- Select and implement a calculation method according to the definition
- Actually calculate similarity scores
- Make recommendations weighted by similarity scores
- Weight by similarity scores (increase the influence of similar users/items)
- Normalize
- Sort appropriately and display results in order of highest score
Examples
User-based
| --- | Item 1 | Item 2 | Item 3 | Item 4 | Item 5 | Item 6 | Item 7 | Item 8 | Similarity |
|---|---|---|---|---|---|---|---|---|---|
| User A | 1 | 0 | x | 1 | 0 | x | 1 | 0 | 1.0 |
| User B | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0.5 |
| User C | 0 | 0 | 1 | 0 | 1 | 0 | 0 | x | 0.25 |
| User D | 0 | 1 | 0 | x | 1 | 0 | 1 | 0 | 0.0 |
| Target user | 1 | 0 | x | 1 | 0 | x | x | x | - |
| Score (n=1) | - | - | x | - | - | x | 1 | 0 | ref A |
| n=2 | - | - | 1 | - | - | 0 | 0.5 | 0.5 | ref A&B |
-
Similarity is simplified using match rate (proportion of matching values)
-
Evaluation is simplified to 1 (purchased or not, liked or not)
-
K-NN (n=1): Recommend Item 7, which is highly rated by the most similar User A
-
K-NN (n=2): Recommend Item 3, which has the highest average rating from similar Users A and B
Common Improvements
- There are many improvements in computing similarity between users and calculating which items to recommend from similar users
- Include whether items were viewed in the similarity calculation
- Weight by similarity when calculating items to recommend
- Weight by the number of raters, etc.
- For sparse data, apply dimensionality reduction
- When using PCA, fill missing values with default voting values (see below) or the EM algorithm
- To reduce dependence on rating values, convert to rankings and use ranking-based similarity metrics
User-based Specific Improvements
- When there are few common raters per user, fill unrated items with the overall rating for that item from all users (default voting value)
- Give more weight to extreme ratings (minimum and maximum)
Item-based
| Item 1 | Item 2 | Item 3 | Item 4 | Item 5 | Item 6 | Item 7 | Item 8 | |
|---|---|---|---|---|---|---|---|---|
| User A | 1 | 0 | x | 1 | 0 | x | 1 | 0 |
| User B | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 |
| User C | 0 | 0 | 1 | 0 | 1 | 0 | 0 | x |
| User D | 0 | 1 | 0 | x | 1 | 0 | 1 | 0 |
| Target user | 1 | 0 | x | 1 | 0 | x | x | x |
| Similarity with Item 1 | - | - | 0.33 | - | - | 1.00 | 0.75 | 0.33 |
| Similarity with Item 4 | - | - | 0.5 | - | - | 0.5 | 0.67 | 0.5 |
| Similarity with highly rated items | - | - | 0.42 | - | - | 0.75 | 0.71 | 0.42 |
-
Simplified using the same calculation method as user-based
-
Improvement methods are also similar to user-based
-
Using items 4 that the target user rated highly, calculate similarity of unrated items
-
Recommend Item 6, which has the highest similarity
Item-based Specific Improvements
-
By recommending items similar to co-occurrence, browsing, shopping, and recent usage history, temporary personalization can be achieved
-
Advantageous over user-based in terms of privacy protection
Problems
- In situations where recommended items change frequently (such as movies and concerts), frequent neighbor updates are required
- Experimentally tends to be biased toward specific items (though there is no theoretical basis)
Differentiating User-based and Item-based
- User-based
- Fits in memory with frequent changes
- Each user's preferences have unique values
- Sites that recommend specific things
- High serendipity (discovering unexpected items)
- Item-based
- Targets large data
- When there are many items, it is difficult to compute similarity between users
- Few people buy the same items, or the matrix becomes sparse
- Item-to-item similarity can be pre-computed, making production recommendations easier
- Item-to-item similarity changes less frequently, requiring fewer recalculations
- Easy to explain
Factorization Machines
-
A general-purpose model presented at ICDM in 2010
-
Efficiently computes feature interactions
- When there are features representing age and gender, the AND condition feature of age + gender is represented by the interaction between age and gender
- The interaction weight is learned as the inner product of vectors corresponding to each feature
- Higher-order FMs have been proposed, but as of 2018, accuracy improvements only go up to around 3rd order
-
Somewhat different in concept from CF
| user A | user B | user C | - | Item a | Item b |
|---|---|---|---|---|---|
| 1 | 0 | 0 | 1 | 0 | |
| 1 | 0 | 0 | 0 | 1 | |
| 0 | 1 | 0 | 1 | 0 | |
| 0 | 1 | 0 | 0 | 1 | |
| 0 | 0 | 1 | 0 | 1 |
- Each rating is represented as one row, and the user-item interaction feature vector is extracted
- Interactions can freely include time, context, etc.
- For example, adding Age allows the model to handle User x Item x Age interactions
Prediction Formula
- 2nd-order FM
Row v_i in V is the i-th variable with k factors
k in N+0 is a hyperparameter defining the dimensionality of the factorization
- Reference: Standard 2nd-order linear model with interactions
- When directly applied to recommendation data, is sparse, making it difficult to learn
- Therefore, that part is learned as an inner product of vectors
- This also reduces the number of learning parameters
- 2-way FM captures single-variable + pairwise interactions between variables
- w0 is the global bias
- wi models the strength of the i-th variable
- models the interaction between the i-th and j-th variables
- Instead of using a parameter w_i,j for each interaction, the FM model factorizes to model interactions
- This is important for sparse vectors
Prediction
- Regression
- Use y^(x) directly as the prediction target and train with MSE, etc.
- Binary classification
- Use y^(x) as the label and train with logistic regression or hinge loss
- Ranking
- Rank vector x by and train with pairwise classification loss for pairs
Optimization Methods
- SGD, ALS, MCMC, etc.
- Can train with various losses including MSE, logistic loss, hinge loss
- Use L2 regularization to prevent overfitting
- Early stopping may give better accuracy
- The gradient is expressed as follows (for the 1st order case):
The partial derivative d is written as plain d due to formula rendering issues.
SVD++
Original MF (SVD)
SVD++
where is the set of items rated by user and is the effect of a user action on item .
The loss function is:
Field-aware FM; FFM
- Original paper: https://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf
- Competitions won with this model (by the paper's authors):
Prediction Formula
- An extension of FM
- FM learns one embedding vector per feature
- FFM improves this by dividing by Field (original category classification) and learning Field-number of embedding vectors per feature
- FM assumed one-hot encoding of categorical variables as features, so different attribute information was uniformly converted to 1 values
- Information about which original categorical variable each feature belongs to was lost
- FFM can express field-specific interaction factors
where and denote the fields that and belong to respectively.
Training Method
- Same as FM
- However, due to the increased number of parameters, it is more prone to overfitting, so early stopping is used
Deep FM Variants
- Factorization Machines https://data.gunosy.io/entry/deep-factorization-machines-2018#f-95055ec5
- Vector-wise interactions are beneficial
- Explicit architectures are needed to learn higher-order features
- Different properties from standard DNNs
- (A network with different properties from standard DNNs is needed)
Factorization-machine supported Neural Network (FNN)
-
ECIR2016
-
Standard MLP
- Categorical input data is one-hot encoded per variable, then embedded and fed into a dense network
- FM weights and vectors are used to initialize layer weights
Deep Crossing Network (DCN)
- KDD2016
- Replaces the MLP part of FNN with ResNet-like skip connections to predict residuals
- Since experiments were on internal data, it is not widely cited, but later research shows higher accuracy than FNN
Wide & Deep (arXiv:1606.07792)
- 2016
- Uses embedded categorical values, same as FNN and DCN
- Attaches a linear model at the output layer
- Better accuracy than FNN and DCN, plus a TensorFlow API exists
Product-based Neural Network (PNN)
-
ICDM2017
-
Embedding categorical variables makes it difficult to handle explicit interactions between categorical variables
-
Incorporates categorical variable interaction terms into the model architecture
-
Creates a layer for interactions between categorical variables
-
PNN is more accurate than FNN with stacked embeddings
- (Possibly weaker than Wide & Deep)
DeepFM
-
IJCAI2017
-
Evolution of Wide & Deep?
-
Feed sparse features into an embedding layer
- Branch 1: Feed into Factorization Machine
- Branch 2: Feed into a dense network
-
Combine Branch 1 and 2 for output
-
More accurate than FM & Deep (version without embedding the FM input)
Deep & Cross Network (CrossNet)
- ADKDD2017
- Use continuous values as-is, embed categorical variables as features
- Branch as follows:
- Branch 1: Feed into a Cross Network that considers interaction terms
- Branch 2: Feed into a dense network
- Combine Branches 1 and 2, apply FC, then output with Sigmoid
xDeepFM
- KDD2018
- Learns higher-order features explicitly at the vector level using an architecture called Compressed Interaction Network (CIN)
- Branch:
- Linear layer
- Embedding
- Branch from 1-2 Embedding:
- CIN
- Plain DNN
- Combine 1-1, 2-1, 2-2 for output
FFM Extensions
- https://qiita.com/guglilac/items/6c8971d27c143e2567a4
- Extensions based on FFM rather than FM, so they are relatively newer
- Understanding is somewhat shallow
Field-aware Probabilistic Embedding Neural Network (FPENN)
- RecSys2018
- Prevents overfitting + introduces NN
- Same approach of learning Field-number of embeddings
- However, estimates the parameters of the probability density that the embedding follows
- This enables robust learning against noise in the training data
- Combination with Deep follows DeepFM (?)
Field-weighted Factorization Machine (FwFM)
- WWW2018
- The importance for prediction varies for each pair of fields
- Therefore, the importance of each field pair is also formalized
Interaction-aware Factorization Machines (IFM)
- AAAI 2019
- Attention + FM
- Considers attention at both feature level and field level
- Since attention is not well understood, this model and the one below are not fully grasped
Field Attention Deep Field-aware Factorization Machine (FAT-DeepFFM)
-
ICDM 2019
-
Attention + FFM
-
Considers attention at the embedding stage
Model-based
Cluster Model
Method
- Create clusters of users with similar preference patterns
- Uses item rating values, etc.
- Find the most similar cluster for the user
- Recommend items in order of highest average rating within the cluster
Properties
- The nature of recommendations varies greatly depending on the number of clusters
- The number of clusters needs to be adjusted according to the objective
- The types of recommendations are limited by the number of clusters
- With few clusters, recommendations are broad and not very personalized
- On the other hand, more robust against the cold start problem
- With many clusters, it is difficult to find stable clusters
- On the other hand, the degree of personalization increases
- Cannot address the gray sheep problem
- Since users are assigned to specific clusters, recommendations spanning multiple clusters are not possible
- With few clusters, recommendations are broad and not very personalized
- Easy to implement and fast execution speed
Function-based
- General models using utility functions that take larger values the more a user likes something, or functions that predict rating values
- Reduces to regression, classification, ranking regression, etc.
Regression
About the Model
-
Decompose the rating matrix R (without missing values) as follows:
-
U is dimensional, V is dimensional
- Column x of U represents features of user x, column y of V represents features of item y
- Therefore, user x's rating for item y is:
- This can be viewed as a similar model to user-based collaborative filtering where u_x = 1/|v|, v_y = item ratings
Training Method
- Learn U and V by decomposition to minimize the expected loss between R and R*
- Use matrix factorization techniques
- Assume the residual R-R* follows a normal distribution, and compute U and V to maximize the generation probability of observed ratings (factor analysis)
- Minimize the objective function
- lambda is a parameter to prevent overfitting
- Train with SGD or ALS (fix other values and update parameters sequentially)
- For missing values, the following approaches are possible:
- Treat as latent variables and solve using the EM algorithm
- Compute only on data without missing values
- Fill with mean values, etc.
- Ignore missing values when computing the loss
Others
-
Simple enough to extend with privacy-preserving filtering and nonlinear regression for prediction accuracy improvement
-
Representing the entire system with a single linear model may be too coarse, resulting in low prediction accuracy
- Can be addressed with the Horting method, which constructs local linear models among users with similar preferences
Classification
- Ignore the magnitude relationship of rating values and treat them as classes
- Modifications to consider magnitude relationships are also possible
- Features can include ratings from other users for the target item and ratings from the target user for other items
Sequential Binary Relation Learning
- Acquire a function representing the ease of classification for each class
- Classify the target into the class where the function value is maximum
where:
-
is the user-similarity weight for user evaluating item with rating
-
is the item-similarity weight for item evaluated by user with rating
-
u, w are obtained within an online learning framework
-
SGD with non-converging learning rate (Momentum), etc.
- First initialize u, w
- Each time r is observed, for all users i other than user x who have rated item y:
- Increase v if the rating is the same, decrease otherwise
- Also, for all items other than y that user x has rated:
- Increase weight w if the rating is the same, decrease otherwise
-
Others
-
Being online learning, it can handle addition of items and users despite being model-based
-
Cannot handle deletion of items or users
Ranking
- There is research on recommending with RankBoost
k-NN
-
Uses a model but is not model-based?
-
Used in the context of CF
-
Makes recommendations from the ratings of K nearest items/users
- May also take a weighted sum favoring closer users
Rule-based
- Could be called a model, but refers to expert models?
Bayesian Classifier (Probabilistic Model)
History-conditioned Type
- The conditional expected value of the target user's rating for item y, given the target user's past preference data
- The probability distribution is unknown, so an estimated function is used
- Other sample users' preference data is not referenced in the formula, but it is used in the estimation of the probability distribution, so this qualifies as collaborative filtering-based recommendation
where:
-
is the rating given by user to item
-
is the set of items rated by user
-
For preference data with R = 1, this becomes a conditional probability distribution
- Training data is required, but with implicit ratings, noise where r_ay=0 becomes large, which is disadvantageous
- Also, this form requires a model for each individual user, which is impractical
- In practice, the joint distribution of ratings for all items, independent of individual users, is computed
where is the rating distribution of all users for item
- The conditional distribution in the first formula is computed using Bayes' rule and marginalization to eliminate unnecessary variables
- However, with |Y| = m, the number of parameters R^(r) = R1 x R2 x R3 x ... x Rm increases exponentially with m
- Therefore, partial conditional independence between variables r.y is assumed to reduce the parameters to learn
- Probabilistic models with conditional independence are called graphical models because they represent dependencies as graphs
- Bayesian networks are one such model
Bayesian Networks
-
Bayesian networks are pre-built from past user behavior data
-
In production, items to recommend are determined when events occur
- Feed viewed pages to the Bayesian network for inference, computing viewing probabilities
-
Some research uses them to create user segments
- In this case, the probability of belonging to each segment is obtained
- Clustering can also provide distances from each cluster center
- Bayesian networks do not require data for all nodes
-
Maximum Weighted Spanning Tree (MWST) Algorithm
- Suitable for large-scale data with good prediction performance, computational complexity is O(n^2) with respect to the number of nodes
- Constructs a spanning tree with the highest conditional probability
- Covered separately below
-
Pearl's Message Passing Algorithm
- A method for quickly computing marginal posterior probabilities of all nodes in the network
-
User-item Bayesian
- Compares metadata of items rated by the user with metadata of other items
- Computes the probability of selecting other items
-
User-user Bayesian
- Compares metadata of items liked by users
- Computes similarity of preferences between users
-
Item-item Bayesian
- Compares item metadata
- Computes similarity between items
Naive Bayes
Co-occurrence Type
- The fact that a user rated an item is viewed as the co-occurrence of the event "the user is x" and the event "the item is y"
- This co-occurrence probability is modeled
- Uses Probabilistic Latent Semantic Analysis (pLSA)
- Very flexible, allowing extensions to incorporate factors such as rating variation, item information, context information, and personal attribute information
- Extending to the Bayesian framework to handle new users yields Latent Dirichlet Allocation (LDA)
- Details at http://www.kamishima.net/archive/recsysdoc.pdf P70
MWST Algorithm
- A method for constructing a tree based on mutual information between variables
- Consists of the following steps:
- From the given data, compute the mutual information I(xi; xj) for all Nx(N-1)/2 edges
- Extract the edge with the highest mutual information and make it part of the tree
- Add the edge with the next highest mutual information to the tree
- However, discard the edge if it creates a loop
- Repeat step 3 until N-1 edges are added to the tree
- Finally, determine the root node and direct edges from the root toward the leaves
- In a paper recommending web pages (100,000+ pages), the root node was set as the page viewed by the most users
- Building a Bayesian network for all site web pages was computationally infeasible, with a limit of about 5,000 pages
- Site pages were divided into several dozen categories, Bayesian networks were built with category variables, and category inference was performed for users
Advantages
- Only uses up to second-order statistics, so computation from data is easy and reliable
- Computational complexity is at most O(n^2) with respect to the number of nodes n
- Statistically has no consistency, but prediction efficiency is good
Pearl's Message Passing Algorithm
- A method for fast probabilistic inference
- Generally, in Bayesian networks, after constructing the network or when data is given to nodes, marginalization is needed to compute the posterior probability of a given node
- This algorithm efficiently computes this marginalization using the network structure
- Send messages from the node that received data to its neighboring nodes
- Nodes that receive a message update their marginal posterior probabilities using the received message
- Updated nodes send messages to their neighboring nodes (except the sender)
- Repeat 2 and 3 to update all nodes' marginal posterior probabilities
At Inference Time
- Feed the pages viewed by the user as data into the Bayesian network for inference, computing item preference probabilities for that user
- Use Pearl's Message Passing algorithm for speed
- Arrange items in descending order of computed preference probability
Time Series Models
Maximum Entropy Model
where:
-
is the normalization constant
-
are weight parameters
-
are feature functions
-
Feature functions are functions that return 1 when the purchase history has a certain characteristic and the next purchased item is a specific one, and 0 otherwise
-
The probability of purchasing item y' next given that item y is in the purchase history:
We select items where this difference is largest.
- By selecting feature functions this way, model parameters lambda_k can be estimated based on the maximum entropy principle
- Various feature functions can be used
Markov Process Model
- Predicts using the most recent K items in the sequence
- Uses reinforcement learning, etc.
Subscription/Repeat Purchase
- Uses techniques such as survival analysis
- Cox proportional hazards model
Content-based Filtering
Data Format
- There are preference data and search queries
- For preference data, items are predicted using user attributes and context attributes as features with machine learning algorithms
References
-
About recommendation systems in general: https://www.kamishima.net/archive/recsysdoc.pdf
-
Classification of collaborative filtering: https://eulerdijkstra.hatenadiary.org/entry/20130407/1365349866
-
Building recommendation systems with CF: https://www.slideshare.net/masayuki1986/recommendation-ml
-
Japanese research using Bayesian networks for web page recommendation: https://www.ai-gakkai.or.jp/jsai2007/data/pdf/100205.pdf
-
About FM, FFM, and FM+NN: https://qiita.com/s_katagiri/items/d630685dd2fe2627ecd3
-
Original FM paper: https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
-
DeepFM trends (2018): https://data.gunosy.io/entry/deep-factorization-machines-2018#f-95055ec5