简介：我有一个超过30 000个值的列表，范围从0到47，例如[0,0,0,0，...，1,1,1,1，...，2,2,2,2， ......，47等]是连续分布 .

问题：基于我的分布，我想计算任何给定值的p值（看到更大值的概率） . 例如，正如您所见，0的p值接近1，较高的数值的p值趋于0 .

我不知道我是否正确，但是为了确定概率，我认为我需要将我的数据拟合到最适合描述我的数据的理论分布 . 我认为需要某种拟合优度测试来确定最佳模型 .

有没有办法在Python中实现这样的分析（Scipy或Numpy）？你能举个例子吗？

谢谢！

6 回答

107

具有平方误差和（SSE）的分布拟合

这是对Saullo's answer的更新和修改，它使用当前scipy.stats distributions的完整列表，并返回分布's histogram and the data'直方图之间的分布最少的SSE .

示例配件

使用El Niño dataset from statsmodels，分布是合适的并且确定错误 . 返回具有最小错误的分布 .

所有发行版

All Fitted Distributions

最佳配送

Best Fit Distribution

示例代码

%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Distributions to check
    DISTRIBUTIONS = [        
        st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime,st.bradford,st.burr,st.cauchy,st.chi,st.chi2,st.cosine,
        st.dgamma,st.dweibull,st.erlang,st.expon,st.exponnorm,st.exponweib,st.exponpow,st.f,st.fatiguelife,st.fisk,
        st.foldcauchy,st.foldnorm,st.frechet_r,st.frechet_l,st.genlogistic,st.genpareto,st.gennorm,st.genexpon,
        st.genextreme,st.gausshyper,st.gamma,st.gengamma,st.genhalflogistic,st.gilbrat,st.gompertz,st.gumbel_r,
        st.gumbel_l,st.halfcauchy,st.halflogistic,st.halfnorm,st.halfgennorm,st.hypsecant,st.invgamma,st.invgauss,
        st.invweibull,st.johnsonsb,st.johnsonsu,st.ksone,st.kstwobign,st.laplace,st.levy,st.levy_l,st.levy_stable,
        st.logistic,st.loggamma,st.loglaplace,st.lognorm,st.lomax,st.maxwell,st.mielke,st.nakagami,st.ncx2,st.ncf,
        st.nct,st.norm,st.pareto,st.pearson3,st.powerlaw,st.powerlognorm,st.powernorm,st.rdist,st.reciprocal,
        st.rayleigh,st.rice,st.recipinvgauss,st.semicircular,st.t,st.triang,st.truncexpon,st.truncnorm,st.tukeylambda,
        st.uniform,st.vonmises,st.vonmises_line,st.wald,st.weibull_min,st.weibull_max,st.wrapcauchy
    ]

    # Best holders
    best_distribution = st.norm
    best_params = (0.0, 1.0)
    best_sse = np.inf

    # Estimate distribution parameters from data
    for distribution in DISTRIBUTIONS:

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')

                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]

                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))

                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                if best_sse > sse > 0:
                    best_distribution = distribution
                    best_params = params
                    best_sse = sse

        except Exception:
            pass

    return (best_distribution.name, best_params)

def make_pdf(dist, params, size=10000):
    """Generate distributions's Probability Distribution Function """

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

    # Build PDF and turn into pandas Series
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    pdf = pd.Series(y, x)

    return pdf

# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())

# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind='hist', bins=50, normed=True, alpha=0.5, color=plt.rcParams['axes.color_cycle'][1])
# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_fit_name, best_fit_params = best_fit_distribution(data, 200, ax)
best_dist = getattr(st, best_fit_name)

# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u'El Niño sea temp.\n All Fitted Distributions')
ax.set_xlabel(u'Temp (°C)')
ax.set_ylabel('Frequency')

# Make PDF with best params 
pdf = make_pdf(best_dist, best_fit_params)

# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label='PDF', legend=True)
data.plot(kind='hist', bins=50, normed=True, alpha=0.5, label='Data', legend=True, ax=ax)

param_names = (best_dist.shapes + ', loc, scale').split(', ') if best_dist.shapes else ['loc', 'scale']
param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_fit_params)])
dist_str = '{}({})'.format(best_fit_name, param_str)

ax.set_title(u'El Niño sea temp. with best fit distribution \n' + dist_str)
ax.set_xlabel(u'Temp. (°C)')
ax.set_ylabel('Frequency')

回复于 2024-05-02T06:11:02+08:00

有82 implemented distribution functions in SciPy 0.12.0 . 您可以使用fit() method测试其中一些数据是如何适合您的数据的 . 请查看以下代码了解更多详情：

enter image description here

import matplotlib.pyplot as plt
import scipy
import scipy.stats
size = 30000
x = scipy.arange(size)
y = scipy.int_(scipy.round_(scipy.stats.vonmises.rvs(5,size=size)*47))
h = plt.hist(y, bins=range(48), color='w')

dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(y)
    pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
    plt.plot(pdf_fitted, label=dist_name)
    plt.xlim(0,47)
plt.legend(loc='upper right')
plt.show()

参考文献：

- Fitting distributions, goodness of fit, p-value. Is it possible to do this with Scipy (Python)?

- Distribution fitting with Scipy

这里有一个列表，其中包含Scipy 0.12.0（VI）中可用的所有分布函数的名称：

dist_names = [ 'alpha', 'anglit', 'arcsine', 'beta', 'betaprime', 'bradford', 'burr', 'cauchy', 'chi', 'chi2', 'cosine', 'dgamma', 'dweibull', 'erlang', 'expon', 'exponweib', 'exponpow', 'f', 'fatiguelife', 'fisk', 'foldcauchy', 'foldnorm', 'frechet_r', 'frechet_l', 'genlogistic', 'genpareto', 'genexpon', 'genextreme', 'gausshyper', 'gamma', 'gengamma', 'genhalflogistic', 'gilbrat', 'gompertz', 'gumbel_r', 'gumbel_l', 'halfcauchy', 'halflogistic', 'halfnorm', 'hypsecant', 'invgamma', 'invgauss', 'invweibull', 'johnsonsb', 'johnsonsu', 'ksone', 'kstwobign', 'laplace', 'logistic', 'loggamma', 'loglaplace', 'lognorm', 'lomax', 'maxwell', 'mielke', 'nakagami', 'ncx2', 'ncf', 'nct', 'norm', 'pareto', 'pearson3', 'powerlaw', 'powerlognorm', 'powernorm', 'rdist', 'reciprocal', 'rayleigh', 'rice', 'recipinvgauss', 'semicircular', 't', 'triang', 'truncexpon', 'truncnorm', 'tukeylambda', 'uniform', 'vonmises', 'wald', 'weibull_min', 'weibull_max', 'wrapcauchy']

回复于 2024-05-02T06:11:02+08:00

3

@Saullo Castro提到的 fit() 方法提供了最大似然估计（MLE） . 数据的最佳分布是给出最高的分布，可以通过几种不同的方式确定：例如

1，给你最高对数可能性的那个 .

2，给你最小的AIC，BIC或BICc值的那个（参见wiki：http://en.wikipedia.org/wiki/Akaike_information_criterion，基本上可以看作是参数数量调整的对数似然，因为预计更多参数的分布更合适）

3，最大化贝叶斯后验概率的那个 . （见维基：http://en.wikipedia.org/wiki/Posterior_probability）

当然，如果您已经有一个应该描述数据的分布（基于您特定领域的理论）并且想要坚持这一点，那么您将跳过识别最佳拟合分布的步骤 .

scipy 没有计算对数似然的函数（虽然提供了MLE方法），但硬代码很容易：参见Is the build-in probability density functions of scipy.stat.distributions slower than a user provided one?

回复于 2024-05-02T06:11:02+08:00
129
AFAICU，您的发行版是离散的（除了离散之外） . 因此，只计算不同值的频率并将其标准化应足以满足您的需要 . 所以，举一个例子来证明这一点：
```
In []: values= [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
In []: counts= asarray(bincount(values), dtype= float)
In []: cdf= counts.cumsum()/ counts.sum()
```
因此，看到高于 1 的值的概率很简单（根据complementary cumulative distribution function (ccdf)：
```
In []: 1- cdf[1]
Out[]: 0.40000000000000002
```
请注意ccdf与survival function (sf)密切相关，但它也使用离散分布定义，而sf仅针对连续分布定义 .
回复于 2024-05-02T06:11:02+08:00

这听起来像是概率密度估计问题 .

from scipy.stats import gaussian_kde
occurences = [0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47]
values = range(0,48)
kde = gaussian_kde(map(float, occurences))
p = kde(values)
p = p/sum(p)
print "P(x>=1) = %f" % sum(p[1:])

另见http://jpktd.blogspot.com/2009/03/using-gaussian-kernel-density.html .

回复于 2024-05-02T06:11:02+08:00

2

如果我不理解您的需要，请原谅我，但是如何将数据存储在字典中，其中键是0到47之间的数字，并且值是原始列表中相关键的出现次数？
因此，您的似然p（x）将是大于x的键除以30000的所有值的总和 .

回复于 2024-05-02T06:11:02+08:00

使用Scipy（Python）将经验分布拟合到理论分布？

6 回答

具有平方误差和（SSE）的分布拟合

示例配件

所有发行版

最佳配送

示例代码

相关问题