from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)
>>>
Help on function pearsonr in module scipy.stats.stats:
pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing
non-correlation.
The Pearson correlation coefficient measures the linear relationship
between two datasets. Strictly speaking, Pearson's correlation requires
that each dataset be normally distributed. Like other correlation
coefficients, this one varies between -1 and +1 with 0 implying no
correlation. Correlations of -1 or +1 imply an exact linear
relationship. Positive correlations imply that as x increases, so does
y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets. The p-values are not entirely
reliable but are probably reasonable for datasets larger than 500 or so.
Parameters
----------
x : 1D array
y : 1D array the same length as x
Returns
-------
(Pearson's correlation coefficient,
2-tailed p-value)
References
----------
http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
import math
# calculates the mean
def mean(x):
sum = 0.0
for i in x:
sum += i
return sum / len(x)
# calculates the sample standard deviation
def sampleStandardDeviation(x):
sumv = 0.0
for i in x:
sumv += (i - mean(x))**2
return math.sqrt(sumv/(len(x)-1))
# calculates the PCC using both the 2 functions above
def pearson(x,y):
scorex = []
scorey = []
for i in x:
scorex.append((i - mean(x))/sampleStandardDeviation(x))
for j in y:
scorey.append((j - mean(y))/sampleStandardDeviation(y))
# multiplies both lists together into 1 list (hence zip) and sums the whole list
return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)
PCC的意义基本上是向您展示两个变量/列表是如何的.1353167_ . 值得注意的是,PCC值的范围为 from -1 to 1 . 0到1之间的值表示正相关 . 值0 =最高变化(无任何相关性) . -1到0之间的值表示负相关 .
0
嗯,很多这些回复都有很长的难以阅读的代码......
在使用数组时,我建议使用numpy及其漂亮的功能:
import numpy as np
def pcc(X, Y):
''' Compute Pearson Correlation Coefficient. '''
# Normalise X and Y
X -= X.mean(0)
Y -= Y.mean(0)
# Standardise X and Y
X /= X.std(0)
Y /= Y.std(0)
# Compute mean product
return np.mean(X*Y)
# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)
# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
# (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# that there is no correlation in the given direction.
#
# direction:
# if positive, p is the probability that there is no positive correlation in
# the population sampled by X and Y
# if negative, p is the probability that there is no negative correlation
# if 0, p is the probability that there is no correlation in either direction
def probabilityNotCorrelated(X, Y, direction=0):
x = len(X)
if x != len(Y):
raise ValueError("variables not same len: " + str(x) + ", and " + \
str(len(Y)))
if x < 6:
raise ValueError("must have at least 6 samples, but have " + str(x))
(corr, prb_2_tail) = stats.pearsonr(X, Y)
if not direction:
return (corr, prb_2_tail)
prb_1_tail = prb_2_tail / 2
if corr * direction > 0:
return (corr, prb_1_tail)
return (corr, 1 - prb_1_tail)
def pearson(x,y):
n=len(x)
vals=range(n)
sumx=sum([float(x[i]) for i in vals])
sumy=sum([float(y[i]) for i in vals])
sumxSq=sum([x[i]**2.0 for i in vals])
sumySq=sum([y[i]**2.0 for i in vals])
pSum=sum([x[i]*y[i] for i in vals])
# Calculating Pearson correlation
num=pSum-(sumx*sumy/n)
den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
if den==0: return 0
r=num/den
return r
def manual_pearson(a,b):
"""
Accepts two arrays of equal length, and computes correlation coefficient.
Numerator is the sum of product of (a - a_avg) and (b - b_avg),
while denominator is the product of a_std and b_std multiplied by
length of array.
"""
a_avg, b_avg = np.average(a), np.average(b)
a_stdev, b_stdev = np.std(a), np.std(b)
n = len(a)
denominator = a_stdev * b_stdev * n
numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
p_coef = numerator/denominator
return p_coef
data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}
import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes
df = pd.DataFrame(data, columns = ['list 1','list 2'])
from scipy import stats # For in-built method to get PCC
pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results
16 回答
你可以看一下scipy.stats:
Pearson相关性可以用numpy的corrcoef来计算 .
另一种选择可以是来自linregress的原生scipy函数,它计算:
这是一个例子:
会回报你:
如果你没有't feel like installing scipy, I'使用过这个快速黑客,稍微修改了Programming Collective Intelligence:
(编辑正确 . )
以下代码是the definition的直接解释:
测试:
回报
这与Excel,this calculator,SciPy(也是NumPy)一致,它们分别返回0.981980506和0.9819805060619657以及0.98198050606196574 .
R:
EDIT :修正了评论者指出的错误 .
您也可以使用pandas.DataFrame.corr执行此操作:
这给了
我认为我的答案应该是最容易编码和计算Pearson相关系数(PCC)的,而不是依赖于numpy / scipy .
PCC的意义基本上是向您展示两个变量/列表是如何的.1353167_ . 值得注意的是,PCC值的范围为 from -1 to 1 . 0到1之间的值表示正相关 . 值0 =最高变化(无任何相关性) . -1到0之间的值表示负相关 .
嗯,很多这些回复都有很长的难以阅读的代码......
在使用数组时,我建议使用numpy及其漂亮的功能:
这是基于稀疏向量的皮尔森相关的实现 . 这里的向量表示为表示为(索引,值)的元组列表 . 两个稀疏矢量可以具有不同的长度,但是在所有矢量大小上必须是相同的 . 这对于文本挖掘应用是有用的,其中矢量大小非常大,因为大多数特征是单词包,因此通常使用稀疏矢量执行计算 .
单元测试:
这是使用numpy的Pearson Correlation函数的实现:
这是mkh答案的变体,运行速度比它快得多,scipy.stats.pearsonr使用numba .
你可能想知道如何在寻找特定方向的相关性(负相关或正相关)的背景下解释你的概率 . 这是我写的一个函数来帮助它 . 它甚至可能是对的!
它基于我从http://www.vassarstats.net/rsig.html和http://en.wikipedia.org/wiki/Student%27s_t_distribution收集的信息,感谢此处发布的其他答案 .
你可以看一下这篇文章 . 这是一个详细记录的示例,用于使用pandas库(对于Python)基于来自多个文件的历史外汇货币对数据计算相关性,然后使用seaborn库生成热图图 .
http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/
我有一个非常简单易懂的解决方案 . 对于两个相等长度的数组,Pearson系数可以很容易地计算如下:
在python中使用pandas进行Pearson系数计算:我建议尝试这种方法,因为您的数据包含列表 . 您可以轻松地与数据进行交互并从控制台进行操作,因为您可以直观地显示数据结构并根据需要进行更新 . 您还可以导出数据集并将其保存并从python控制台中添加新数据以供以后分析 . 此代码更简单,包含更少的代码行 . 我假设您需要一些快速代码来筛选您的数据以进行进一步分析
例:
但是,您没有为我发布数据,以查看数据集的大小或分析前可能需要的转换 .