首页 文章

在Python中计算Pearson相关性和显着性

提问于
浏览
142

我正在寻找一个函数,它将两个列表作为输入,并返回Pearson correlation,以及相关的重要性 .

16 回答

  • 28

    你可以看一下scipy.stats

    from pydoc import help
    from scipy.stats.stats import pearsonr
    help(pearsonr)
    
    >>>
    Help on function pearsonr in module scipy.stats.stats:
    
    pearsonr(x, y)
     Calculates a Pearson correlation coefficient and the p-value for testing
     non-correlation.
    
     The Pearson correlation coefficient measures the linear relationship
     between two datasets. Strictly speaking, Pearson's correlation requires
     that each dataset be normally distributed. Like other correlation
     coefficients, this one varies between -1 and +1 with 0 implying no
     correlation. Correlations of -1 or +1 imply an exact linear
     relationship. Positive correlations imply that as x increases, so does
     y. Negative correlations imply that as x increases, y decreases.
    
     The p-value roughly indicates the probability of an uncorrelated system
     producing datasets that have a Pearson correlation at least as extreme
     as the one computed from these datasets. The p-values are not entirely
     reliable but are probably reasonable for datasets larger than 500 or so.
    
     Parameters
     ----------
     x : 1D array
     y : 1D array the same length as x
    
     Returns
     -------
     (Pearson's correlation coefficient,
      2-tailed p-value)
    
     References
     ----------
     http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
    
  • 0

    Pearson相关性可以用numpy的corrcoef来计算 .

    import numpy
    numpy.corrcoef(list1, list2)[0, 1]
    
  • 41

    另一种选择可以是来自linregress的原生scipy函数,它计算:

    斜率:回归线截距的斜率:回归线的截距r值:相关系数p值:假设检验的双边p值,其零假设是斜率为零stderr:标准误差估计

    这是一个例子:

    a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
    b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
    from scipy.stats import linregress
    linregress(a, b)
    

    会回报你:

    LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)
    
  • 1

    如果你没有't feel like installing scipy, I'使用过这个快速黑客,稍微修改了Programming Collective Intelligence

    (编辑正确 . )

    from itertools import imap
    
    def pearsonr(x, y):
      # Assume len(x) == len(y)
      n = len(x)
      sum_x = float(sum(x))
      sum_y = float(sum(y))
      sum_x_sq = sum(map(lambda x: pow(x, 2), x))
      sum_y_sq = sum(map(lambda x: pow(x, 2), y))
      psum = sum(imap(lambda x, y: x * y, x, y))
      num = psum - (sum_x * sum_y/n)
      den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
      if den == 0: return 0
      return num / den
    
  • 13

    以下代码是the definition的直接解释:

    import math
    
    def average(x):
        assert len(x) > 0
        return float(sum(x)) / len(x)
    
    def pearson_def(x, y):
        assert len(x) == len(y)
        n = len(x)
        assert n > 0
        avg_x = average(x)
        avg_y = average(y)
        diffprod = 0
        xdiff2 = 0
        ydiff2 = 0
        for idx in range(n):
            xdiff = x[idx] - avg_x
            ydiff = y[idx] - avg_y
            diffprod += xdiff * ydiff
            xdiff2 += xdiff * xdiff
            ydiff2 += ydiff * ydiff
    
        return diffprod / math.sqrt(xdiff2 * ydiff2)
    

    测试:

    print pearson_def([1,2,3], [1,5,7])
    

    回报

    0.981980506062
    

    这与Excel,this calculatorSciPy(也是NumPy)一致,它们分别返回0.981980506和0.9819805060619657以及0.98198050606196574 .

    R

    > cor( c(1,2,3), c(1,5,7))
    [1] 0.9819805
    

    EDIT :修正了评论者指出的错误 .

  • 4

    您也可以使用pandas.DataFrame.corr执行此操作:

    import pandas as pd
    a = [[1, 2, 3],
         [5, 6, 9],
         [5, 6, 11],
         [5, 6, 13],
         [5, 3, 13]]
    df = pd.DataFrame(data=a)
    df.corr()
    

    这给了

    0         1         2
    0  1.000000  0.745601  0.916579
    1  0.745601  1.000000  0.544248
    2  0.916579  0.544248  1.000000
    
  • 4

    我认为我的答案应该是最容易编码和计算Pearson相关系数(PCC)的,而不是依赖于numpy / scipy .

    import math
    
    # calculates the mean
    def mean(x):
        sum = 0.0
        for i in x:
             sum += i
        return sum / len(x) 
    
    # calculates the sample standard deviation
    def sampleStandardDeviation(x):
        sumv = 0.0
        for i in x:
             sumv += (i - mean(x))**2
        return math.sqrt(sumv/(len(x)-1))
    
    # calculates the PCC using both the 2 functions above
    def pearson(x,y):
        scorex = []
        scorey = []
    
        for i in x: 
            scorex.append((i - mean(x))/sampleStandardDeviation(x)) 
    
        for j in y:
            scorey.append((j - mean(y))/sampleStandardDeviation(y))
    
    # multiplies both lists together into 1 list (hence zip) and sums the whole list   
        return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)
    

    PCC的意义基本上是向您展示两个变量/列表是如何的.1353167_ . 值得注意的是,PCC值的范围为 from -1 to 1 . 0到1之间的值表示正相关 . 值0 =最高变化(无任何相关性) . -1到0之间的值表示负相关 .

  • 0

    嗯,很多这些回复都有很长的难以阅读的代码......

    在使用数组时,我建议使用numpy及其漂亮的功能:

    import numpy as np
    def pcc(X, Y):
       ''' Compute Pearson Correlation Coefficient. '''
       # Normalise X and Y
       X -= X.mean(0)
       Y -= Y.mean(0)
       # Standardise X and Y
       X /= X.std(0)
       Y /= Y.std(0)
       # Compute mean product
       return np.mean(X*Y)
    
    # Using it on a random example
    from random import random
    X = np.array([random() for x in xrange(100)])
    Y = np.array([random() for x in xrange(100)])
    pcc(X, Y)
    
  • 4

    这是基于稀疏向量的皮尔森相关的实现 . 这里的向量表示为表示为(索引,值)的元组列表 . 两个稀疏矢量可以具有不同的长度,但是在所有矢量大小上必须是相同的 . 这对于文本挖掘应用是有用的,其中矢量大小非常大,因为大多数特征是单词包,因此通常使用稀疏矢量执行计算 .

    def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0):
        indexed_feature_dict = {}
        if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0:
            raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation")
    
        sum_a = sum(value for index, value in first_feature_vector)
        sum_b = sum(value for index, value in second_feature_vector)
    
        avg_a = float(sum_a) / length_of_featureset
        avg_b = float(sum_b) / length_of_featureset
    
        mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + ((
            length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2)))
        mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + ((
            length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2)))
    
        covariance_a_b = 0
    
        #calculate covariance for the sparse vectors
        for tuple in first_feature_vector:
            if len(tuple) != 2:
                raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
            indexed_feature_dict[tuple[0]] = tuple[1]
        count_of_features = 0
        for tuple in second_feature_vector:
            count_of_features += 1
            if len(tuple) != 2:
                raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
            if tuple[0] in indexed_feature_dict:
                covariance_a_b += ((indexed_feature_dict[tuple[0]] - avg_a) * (tuple[1] - avg_b))
                del (indexed_feature_dict[tuple[0]])
            else:
                covariance_a_b += (0 - avg_a) * (tuple[1] - avg_b)
    
        for index in indexed_feature_dict:
            count_of_features += 1
            covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b)
    
        #adjust covariance with rest of vector with 0 value
        covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b
    
        if mean_sq_error_a == 0 or mean_sq_error_b == 0:
            return -1
        else:
            return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)
    

    单元测试:

    def test_get_get_pearson_corelation(self):
        vector_a = [(1, 1), (2, 2), (3, 3)]
        vector_b = [(1, 1), (2, 5), (3, 7)]
        self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None)
    
        vector_a = [(1, 1), (2, 2), (3, 3)]
        vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)]
        self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)
    
  • 33

    这是使用numpy的Pearson Correlation函数的实现:

    def corr(data1, data2):
        "data1 & data2 should be numpy arrays."
        mean1 = data1.mean() 
        mean2 = data2.mean()
        std1 = data1.std()
        std2 = data2.std()
    
    #     corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2)
        corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
        return corr
    
  • 87

    这是mkh答案的变体,运行速度比它快得多,scipy.stats.pearsonr使用numba .

    import numba
    
    @numba.jit
    def corr(data1, data2):
        M = data1.size
    
        sum1 = 0.
        sum2 = 0.
        for i in range(M):
            sum1 += data1[i]
            sum2 += data2[i]
        mean1 = sum1 / M
        mean2 = sum2 / M
    
        var_sum1 = 0.
        var_sum2 = 0.
        cross_sum = 0.
        for i in range(M):
            var_sum1 += (data1[i] - mean1) ** 2
            var_sum2 += (data2[i] - mean2) ** 2
            cross_sum += (data1[i] * data2[i])
    
        std1 = (var_sum1 / M) ** .5
        std2 = (var_sum2 / M) ** .5
        cross_mean = cross_sum / M
    
        return (cross_mean - mean1 * mean2) / (std1 * std2)
    
  • 171

    你可能想知道如何在寻找特定方向的相关性(负相关或正相关)的背景下解释你的概率 . 这是我写的一个函数来帮助它 . 它甚至可能是对的!

    它基于我从http://www.vassarstats.net/rsig.htmlhttp://en.wikipedia.org/wiki/Student%27s_t_distribution收集的信息,感谢此处发布的其他答案 .

    # Given (possibly random) variables, X and Y, and a correlation direction,
    # returns:
    #  (r, p),
    # where r is the Pearson correlation coefficient, and p is the probability
    # that there is no correlation in the given direction.
    #
    # direction:
    #  if positive, p is the probability that there is no positive correlation in
    #    the population sampled by X and Y
    #  if negative, p is the probability that there is no negative correlation
    #  if 0, p is the probability that there is no correlation in either direction
    def probabilityNotCorrelated(X, Y, direction=0):
        x = len(X)
        if x != len(Y):
            raise ValueError("variables not same len: " + str(x) + ", and " + \
                             str(len(Y)))
        if x < 6:
            raise ValueError("must have at least 6 samples, but have " + str(x))
        (corr, prb_2_tail) = stats.pearsonr(X, Y)
    
        if not direction:
            return (corr, prb_2_tail)
    
        prb_1_tail = prb_2_tail / 2
        if corr * direction > 0:
            return (corr, prb_1_tail)
    
        return (corr, 1 - prb_1_tail)
    
  • 16

    你可以看一下这篇文章 . 这是一个详细记录的示例,用于使用pandas库(对于Python)基于来自多个文件的历史外汇货币对数据计算相关性,然后使用seaborn库生成热图图 .

    http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/

  • 7
    def pearson(x,y):
      n=len(x)
      vals=range(n)
    
      sumx=sum([float(x[i]) for i in vals])
      sumy=sum([float(y[i]) for i in vals])
    
      sumxSq=sum([x[i]**2.0 for i in vals])
      sumySq=sum([y[i]**2.0 for i in vals])
    
      pSum=sum([x[i]*y[i] for i in vals])
      # Calculating Pearson correlation
      num=pSum-(sumx*sumy/n)
      den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
      if den==0: return 0
      r=num/den
      return r
    
  • 0

    我有一个非常简单易懂的解决方案 . 对于两个相等长度的数组,Pearson系数可以很容易地计算如下:

    def manual_pearson(a,b):
    """
    Accepts two arrays of equal length, and computes correlation coefficient. 
    Numerator is the sum of product of (a - a_avg) and (b - b_avg), 
    while denominator is the product of a_std and b_std multiplied by 
    length of array. 
    """
      a_avg, b_avg = np.average(a), np.average(b)
      a_stdev, b_stdev = np.std(a), np.std(b)
      n = len(a)
      denominator = a_stdev * b_stdev * n
      numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
      p_coef = numerator/denominator
      return p_coef
    
  • 1

    在python中使用pandas进行Pearson系数计算:我建议尝试这种方法,因为您的数据包含列表 . 您可以轻松地与数据进行交互并从控制台进行操作,因为您可以直观地显示数据结构并根据需要进行更新 . 您还可以导出数据集并将其保存并从python控制台中添加新数据以供以后分析 . 此代码更简单,包含更少的代码行 . 我假设您需要一些快速代码来筛选您的数据以进行进一步分析

    例:

    data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}
    
    import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes
    
    df = pd.DataFrame(data, columns = ['list 1','list 2'])
    
    from scipy import stats # For in-built method to get PCC
    
    pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
    print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results
    

    但是,您没有为我发布数据,以查看数据集的大小或分析前可能需要的转换 .

相关问题