首页 文章

在numpy / scipy中为稀疏的矩阵添加一个非常重复的矩阵?

提问于
浏览
5

我正在尝试在NumPy / Scipy中实现一个函数来计算单个(训练)向量和大量其他(观察)向量之间的Jensen-Shannon divergence . 观察向量存储在非常大的(500,000x65536)Scipy sparse matrix(密集矩阵不适合存储器)中 .

作为算法的一部分,我需要为每个观察向量Oi计算T Oi,其中T是训练向量 . 我不是常见的广播规则,因为稀疏矩阵似乎不支持那些(如果T保留为密集数组,Scipy会尝试使稀疏矩阵首先密集,这会耗尽内存;如果我将T变为稀疏矩阵,T Oi失败,因为形状不一致) .

目前,我正在采取将训练向量平铺为500,000x65536稀疏矩阵的非常低效的步骤:

training = sp.csr_matrix(training.astype(np.float32))
tindptr = np.arange(0, len(training.indices)*observations.shape[0]+1, len(training.indices), dtype=np.int32)
tindices = np.tile(training.indices, observations.shape[0])
tdata = np.tile(training.data, observations.shape[0])
mtraining = sp.csr_matrix((tdata, tindices, tindptr), shape=observations.shape)

但是当它只存储~1500个“真实”元素时,它占用了大量的内存(大约6GB) . 构建起来也很慢 .

我试图通过使用stride_tricks使CSR矩阵的indptr变得聪明,数据成员不会在重复数据上使用额外的内存 .

training = sp.csr_matrix(training)
mtraining = sp.csr_matrix(observations.shape,dtype=np.int32)
tdata = training.data
vdata = np.lib.stride_tricks.as_strided(tdata, (mtraining.shape[0], tdata.size), (0, tdata.itemsize))
indices = training.indices
vindices = np.lib.stride_tricks.as_strided(indices, (mtraining.shape[0], indices.size), (0, indices.itemsize))
mtraining.indptr = np.arange(0, len(indices)*mtraining.shape[0]+1, len(indices), dtype=np.int32)
mtraining.data = vdata
mtraining.indices = vindices

但这不起作用,因为跨步视图mtraining.data和mtraining.indices是错误的形状(并且根据this answer那里看起来像一个数组(例如它没有dtype成员),并使用flatten ()方法最终制作副本 .

有没有办法完成这项工作?

1 回答

  • 3

    你甚至没有考虑过的另一个选择是自己以稀疏格式实现总和,这样你就可以充分利用数组的周期性 . 如果你滥用scipy的稀疏矩阵的这种特殊行为,这可能很容易做到:

    >>> a = sps.csr_matrix([1,2,3,4])
    >>> a.data
    array([1, 2, 3, 4])
    >>> a.indices
    array([0, 1, 2, 3])
    >>> a.indptr
    array([0, 4])
    
    >>> b = sps.csr_matrix((np.array([1, 2, 3, 4, 5]),
    ...                     np.array([0, 1, 2, 3, 0]),
    ...                     np.array([0, 5])), shape=(1, 4))
    >>> b
    <1x4 sparse matrix of type '<type 'numpy.int32'>'
        with 5 stored elements in Compressed Sparse Row format>
    >>> b.todense()
    matrix([[6, 2, 3, 4]])
    

    因此,您甚至不必在训练向量和观察矩阵的每一行之间寻找巧合来添加它们:只需用正确的指针填充所有数据,并且需要求和的东西将得到求和何时访问数据 .

    EDIT

    鉴于第一个代码的速度很慢,您可以按如下方式将内存换成速度:

    def csr_add_sparse_vec(sps_mat, sps_vec) :
        """Adds a sparse vector to every row of a sparse matrix"""
        # No checks done, but both arguments should be sparse matrices in CSR
        # format, both should have the same number of columns, and the vector
        # should be a vector and have only one row.
    
        rows, cols = sps_mat.shape
        nnz_vec = len(sps_vec.data)
        nnz_per_row = np.diff(sps_mat.indptr)
        longest_row = np.max(nnz_per_row)
    
        old_data = np.zeros((rows * longest_row,), dtype=sps_mat.data.dtype)
        old_cols = np.zeros((rows * longest_row,), dtype=sps_mat.indices.dtype)
    
        data_idx = np.arange(longest_row) < nnz_per_row[:, None]
        data_idx = data_idx.reshape(-1)
        old_data[data_idx] = sps_mat.data
        old_cols[data_idx] = sps_mat.indices
        old_data = old_data.reshape(rows, -1)
        old_cols = old_cols.reshape(rows, -1)
    
        new_data = np.zeros((rows, longest_row + nnz_vec,),
                            dtype=sps_mat.data.dtype)
        new_data[:, :longest_row] = old_data
        del old_data
        new_cols = np.zeros((rows, longest_row + nnz_vec,),
                            dtype=sps_mat.indices.dtype)
        new_cols[:, :longest_row] = old_cols
        del old_cols
        new_data[:, longest_row:] = sps_vec.data
        new_cols[:, longest_row:] = sps_vec.indices
        new_data = new_data.reshape(-1)
        new_cols = new_cols.reshape(-1)
        new_pointer = np.arange(0, (rows + 1) * (longest_row + nnz_vec),
                                longest_row + nnz_vec)
    
        ret = sps.csr_matrix((new_data, new_cols, new_pointer),
                             shape=sps_mat.shape)
        ret.eliminate_zeros()
    
        return ret
    

    它没有以前那么快,但它可以在大约1秒内完成10,000行:

    In [2]: a
    Out[2]: 
    <10000x65536 sparse matrix of type '<type 'numpy.float64'>'
        with 15000000 stored elements in Compressed Sparse Row format>
    
    In [3]: b
    Out[3]: 
    <1x65536 sparse matrix of type '<type 'numpy.float64'>'
        with 1500 stored elements in Compressed Sparse Row format>
    
    In [4]: csr_add_sparse_vec(a, b)
    Out[4]: 
    <10000x65536 sparse matrix of type '<type 'numpy.float64'>'
        with 30000000 stored elements in Compressed Sparse Row format>
    
    In [5]: %timeit csr_add_sparse_vec(a, b)
    1 loops, best of 3: 956 ms per loop
    

    EDIT 这段代码非常非常慢

    def csr_add_sparse_vec(sps_mat, sps_vec) :
        """Adds a sparse vector to every row of a sparse matrix"""
        # No checks done, but both arguments should be sparse matrices in CSR
        # format, both should have the same number of columns, and the vector
        # should be a vector and have only one row.
    
        rows, cols = sps_mat.shape
    
        new_data = sps_mat.data
        new_pointer = sps_mat.indptr.copy()
        new_cols = sps_mat.indices
    
        aux_idx = np.arange(rows + 1)
    
        for value, col in itertools.izip(sps_vec.data, sps_vec.indices) :
            new_data = np.insert(new_data, new_pointer[1:], [value] * rows)
            new_cols = np.insert(new_cols, new_pointer[1:], [col] * rows)
            new_pointer += aux_idx
    
        return sps.csr_matrix((new_data, new_cols, new_pointer),
                              shape=sps_mat.shape)
    

相关问题