首页 文章

加速Python中的一对多关联计算

提问于
浏览
2

我想计算一个向量和Python中每行数组之间的Pearson相关系数(假设为numpy和/或scipy) . 由于实际数据阵列的大小和存储器约束,将不可能使用标准相关矩阵计算功能 . 这是我天真的实现:

import numpy as np
import scipy.stats as sps

np.random.seed(0)

def correlateOneWithMany(one, many):
    """Return Pearson's correlation coef of 'one' with each row of 'many'."""
    pr_arr = np.zeros((many.shape[0], 2), dtype=np.float64)
    pr_arr[:] = np.nan
    for row_num in np.arange(many.shape[0]):
        pr_arr[row_num, :] = sps.pearsonr(one, many[row_num, :])
    return pr_arr

obs, varz = 10 ** 3, 500
X = np.random.uniform(size=(obs, varz))

pr = correlateOneWithMany(X[0, :], X)

%timeit correlateOneWithMany(X[0, :], X)
# 10 loops, best of 3: 38.9 ms per loop

任何加速这一点的想法将不胜感激!

1 回答

  • 1

    模块scipy.spatial.distance实现了"correlation distance",它只是减去相关系数的一个 . 您可以使用函数cdist来计算一对多距离,并通过从1中减去结果来获得相关系数 .

    这是您脚本的修改版本,包括使用 cdist 计算相关系数:

    import numpy as np
    import scipy.stats as sps
    from scipy.spatial.distance import cdist
    
    np.random.seed(0)
    
    def correlateOneWithMany(one, many):
        """Return Pearson's correlation coef of 'one' with each row of 'many'."""
        pr_arr = np.zeros((many.shape[0], 2), dtype=np.float64)
        pr_arr[:] = np.nan
        for row_num in np.arange(many.shape[0]):
            pr_arr[row_num, :] = sps.pearsonr(one, many[row_num, :])
        return pr_arr
    
    obs, varz = 10 ** 3, 500
    X = np.random.uniform(size=(obs, varz))
    
    pr = correlateOneWithMany(X[0, :], X)
    
    c = 1 - cdist(X[0:1, :], X, metric='correlation')[0]
    
    print(np.allclose(c, pr[:, 0]))
    

    定时:

    In [133]: %timeit correlateOneWithMany(X[0, :], X)
    10 loops, best of 3: 37.7 ms per loop
    
    In [134]: %timeit 1 - cdist(X[0:1, :], X, metric='correlation')[0]
    1000 loops, best of 3: 1.11 ms per loop
    

相关问题