首页 文章

如何使用增量SVD推荐系统创建建议书

提问于
浏览
1

我正在测试一个根据Simon Funk算法构建的推荐系统 . (由Timely Dev撰写.http://www.timelydevelopment.com/demos/NetflixPrize.aspx

问题是,所有增量SVD算法都试图预测user_id和movie_id的评级 . 但在实际系统中,这应该为活动用户生成一个新项目列表 . 我看到有些人在增量SVD之后使用了kNN,但是如果我不遗漏某些内容,如果我在使用增量SVD创建模型后使用kNN,则会失去所有的性能提升 .

任何人都有使用Incremental SVD / Simon Funk方法的经验,并告诉我如何制作新推荐项目列表?

4 回答

  • 0

    制作推荐电影的方法:

    • 获取尚未查看的电影列表

    • 用用户的特征向量乘以其特征向量 .

    • 按结果降序排序并拍摄热门电影 .

    对于理论:假装只有两个维度(喜剧和戏剧) . 如果我喜欢喜剧,但喜欢戏剧,我的特色向量是 [1.0, 0.0] . 如果您将我与以下电影进行比较:

    Comedy:  [1.0, 0.0] x [1.0, 0.0] = 1
    Dramedy: [0.5, 0.5] x [1.0, 0.0] = 0.5
    Drama:   [0.0, 1.0] x [1.0, 0,0] = 0
    
  • 1

    这是一个基于Yelp Netflix代码的简单Python代码 . 如果您安装Numba,它将以C速度运行 .

    data_loader.py

    import os
    import numpy as np
    from scipy import sparse
    
    class DataLoader:
        def __init__(self):
            pass
    
        @staticmethod
        def create_review_matrix(file_path):
            data = np.array([[int(tok) for tok in line.split('\t')[:3]]
                             for line in open(file_path)])
    
            ij = data[:, :2]
            ij -= 1
            values = data[:, 2]
            review_matrix = sparse.csc_matrix((values, ij.T)).astype(float)
            return review_matrix
    
    movielens_file_path = '%s/Downloads/ml-100k/u1.base' % os.environ['HOME']
    
    my_reviews = DataLoader.create_review_matrix(movielens_file_path)
    
    user_reviews = my_reviews[8]
    user_reviews = user_reviews.toarray().ravel()
    user_rated_movies,  = np.where(user_reviews > 0)
    user_ratings = user_reviews[user_rated_movies]
    
    movie_reviews = my_reviews[:, 201]
    movie_reviews = movie_reviews.toarray().ravel()
    movie_rated_users,  = np.where(movie_reviews > 0)
    movie_ratings = movie_reviews[movie_rated_users]
    
    user_pseudo_average_ratings = {}
    user_pseudo_average_ratings[8] = np.mean(user_ratings)
    user_pseudo_average_ratings[9] = np.mean(user_ratings)
    user_pseudo_average_ratings[10] = np.mean(user_ratings)
    users, movies = my_reviews.nonzero()
    
    users_matrix = np.empty((3, 3))
    users_matrix[:] = 0.1
    
    movies_matrix = np.empty((3, 3))
    movies_matrix[:] = 0.1
    
    result = users_matrix[0] * movies_matrix[0]
    otro = movies_matrix[:, 2]
    otro[2] = 8
    

    funk.py

    # Requires Movielens 100k data 
    import numpy as np, time, sys
    from data_loader import DataLoader
    from numba import jit
    import os
    
    def get_user_ratings(user_id, review_matrix):
        """
        Returns a numpy array with the ratings that user_id has made
    
        :rtype : numpy array
        :param user_id: the id of the user
        :return: a numpy array with the ratings that user_id has made
        """
        user_reviews = review_matrix[user_id]
        user_reviews = user_reviews.toarray().ravel()
        user_rated_movies, = np.where(user_reviews > 0)
        user_ratings = user_reviews[user_rated_movies]
        return user_ratings
    
    def get_movie_ratings(movie_id, review_matrix):
        """
        Returns a numpy array with the ratings that movie_id has received
    
        :rtype : numpy array
        :param movie_id: the id of the movie
        :return: a numpy array with the ratings that movie_id has received
        """
        movie_reviews = review_matrix[:, movie_id]
        movie_reviews = movie_reviews.toarray().ravel()
        movie_rated_users, = np.where(movie_reviews > 0)
        movie_ratings = movie_reviews[movie_rated_users]
        return movie_ratings
    
    def create_user_feature_matrix(review_matrix, NUM_FEATURES, FEATURE_INIT_VALUE):
        """
        Creates a user feature matrix of size NUM_FEATURES X NUM_USERS
        with all cells initialized to FEATURE_INIT_VALUE
    
        :rtype : numpy matrix
        :return: a matrix of size NUM_FEATURES X NUM_USERS
        with all cells initialized to FEATURE_INIT_VALUE
        """
        num_users = review_matrix.shape[0]
        user_feature_matrix = np.empty((NUM_FEATURES, num_users))
        user_feature_matrix[:] = FEATURE_INIT_VALUE
        return user_feature_matrix
    
    def create_movie_feature_matrix(review_matrix, NUM_FEATURES, FEATURE_INIT_VALUE):
        """
        Creates a user feature matrix of size NUM_FEATURES X NUM_MOVIES
        with all cells initialized to FEATURE_INIT_VALUE
    
        :rtype : numpy matrix
        :return: a matrix of size NUM_FEATURES X NUM_MOVIES
        with all cells initialized to FEATURE_INIT_VALUE
        """
        num_movies = review_matrix.shape[1]
        movie_feature_matrix = np.empty((NUM_FEATURES, num_movies))
        movie_feature_matrix[:] = FEATURE_INIT_VALUE
        return movie_feature_matrix
    
    @jit(nopython=True)
    def predict_rating(user_id, movie_id, user_feature_matrix, movie_feature_matrix):
        """
        Makes a prediction of the rating that user_id will give to movie_id if
        he/she sees it
    
        :rtype : float
        :param user_id: the id of the user
        :param movie_id: the id of the movie
        :return: a float in the range [1, 5] with the predicted rating for
        movie_id by user_id
        """
        rating = 1.
        for f in range(user_feature_matrix.shape[0]):
            rating += user_feature_matrix[f, user_id] * movie_feature_matrix[f, movie_id]
    
        # We trim the ratings in case they go above or below the stars range
        if rating > 5: rating = 5
        elif rating < 1: rating = 1
        return rating
    
    @jit(nopython=True)
    def sgd_inner(feature, A_row, A_col, A_data, user_feature_matrix, movie_feature_matrix, NUM_FEATURES):
        K = 0.015
        LEARNING_RATE = 0.001
        squared_error = 0
        for k in range(len(A_data)):
            user_id = A_row[k]
            movie_id = A_col[k]
            rating = A_data[k]
            p = predict_rating(user_id, movie_id, user_feature_matrix, movie_feature_matrix)
            err = rating - p
    
            squared_error += err ** 2
    
            user_feature_value = user_feature_matrix[feature, user_id]
            movie_feature_value = movie_feature_matrix[feature, movie_id]
            #for j in range(NUM_FEATURES):
            user_feature_matrix[feature, user_id] += \
                LEARNING_RATE * (err * movie_feature_value - K * user_feature_value)
            movie_feature_matrix[feature, movie_id] += \
                LEARNING_RATE * (err * user_feature_value - K * movie_feature_value)
    
        return squared_error
    
    def calculate_features(A_row, A_col, A_data, user_feature_matrix, movie_feature_matrix, NUM_FEATURES):
        """
        Iterates through all the ratings in search for the best features that
        minimize the error between the predictions and the real ratings.
        This is the main function in Simon Funk SVD algorithm
    
        :rtype : void
        """
        MIN_IMPROVEMENT = 0.0001
        MIN_ITERATIONS = 100
        rmse = 0
        last_rmse = 0
        print len(A_data)
        num_ratings = len(A_data)
        for feature in xrange(NUM_FEATURES):
            iter = 0
            while (iter < MIN_ITERATIONS) or  (rmse < last_rmse - MIN_IMPROVEMENT):
                last_rmse = rmse
                squared_error = sgd_inner(feature, A_row, A_col, A_data, user_feature_matrix, movie_feature_matrix, NUM_FEATURES)
                rmse = (squared_error / num_ratings) ** 0.5
                iter += 1
            print ('Squared error = %f' % squared_error)
            print ('RMSE = %f' % rmse)
            print ('Feature = %d' % feature)
        return last_rmse
    
    
    LAMBDA = 0.02
    FEATURE_INIT_VALUE = 0.1
    NUM_FEATURES = 20
    
    movielens_file_path = '%s/Downloads/ml-100k/u1.base' % os.environ['HOME']
    
    A = DataLoader.create_review_matrix(movielens_file_path)
    from scipy.io import mmread, mmwrite
    mmwrite('./data/A', A)
    
    user_feature_matrix = create_user_feature_matrix(A, NUM_FEATURES, FEATURE_INIT_VALUE)
    movie_feature_matrix = create_movie_feature_matrix(A, NUM_FEATURES, FEATURE_INIT_VALUE)
    
    users, movies = A.nonzero()
    A = A.tocoo()
    
    rmse = calculate_features(A.row, A.col, A.data, user_feature_matrix, movie_feature_matrix, NUM_FEATURES )
    print 'rmse', rmse
    
  • 0

    我认为这是一个很大的问题,因为我认为有很多推荐方法可以称为“增量SVD” . 要回答您的具体问题:kNN是在预计的项目空间上运行,而不是原始空间,因此应该非常快 .

  • 1

    假设您有n个用户和m个项目 . 增量SVD后,您有k个训练有素的功能 . 为了获得给定用户的新项目,将1xk用户特征向量和kxm项目特征矩阵相乘 . 您最终得到该用户的每个项目的m评级 . 然后只需对它们进行排序,删除它们已经看过的那些,并显示一些新的 .

相关问题