首页 文章

使用Scipy(Python)将经验分布拟合到理论分布?

提问于
浏览
87

简介:我有一个超过30 000个值的列表,范围从0到47,例如[0,0,0,0,...,1,1,1,1,...,2,2,2,2, ......,47等]是连续分布 .

问题:基于我的分布,我想计算任何给定值的p值(看到更大值的概率) . 例如,正如您所见,0的p值接近1,较高的数值的p值趋于0 .

我不知道我是否正确,但是为了确定概率,我认为我需要将我的数据拟合到最适合描述我的数据的理论分布 . 我认为需要某种拟合优度测试来确定最佳模型 .

有没有办法在Python中实现这样的分析(Scipy或Numpy)?你能举个例子吗?

谢谢!

6 回答

  • 107

    具有平方误差和(SSE)的分布拟合

    这是对Saullo's answer的更新和修改,它使用当前scipy.stats distributions的完整列表,并返回分布's histogram and the data'直方图之间的分布最少的SSE .

    示例配件

    使用El Niño dataset from statsmodels,分布是合适的并且确定错误 . 返回具有最小错误的分布 .

    所有发行版

    All Fitted Distributions

    最佳配送

    Best Fit Distribution

    示例代码

    %matplotlib inline
    
    import warnings
    import numpy as np
    import pandas as pd
    import scipy.stats as st
    import statsmodels as sm
    import matplotlib
    import matplotlib.pyplot as plt
    
    matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
    matplotlib.style.use('ggplot')
    
    # Create models from data
    def best_fit_distribution(data, bins=200, ax=None):
        """Model data by finding best fit distribution to data"""
        # Get histogram of original data
        y, x = np.histogram(data, bins=bins, density=True)
        x = (x + np.roll(x, -1))[:-1] / 2.0
    
        # Distributions to check
        DISTRIBUTIONS = [        
            st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime,st.bradford,st.burr,st.cauchy,st.chi,st.chi2,st.cosine,
            st.dgamma,st.dweibull,st.erlang,st.expon,st.exponnorm,st.exponweib,st.exponpow,st.f,st.fatiguelife,st.fisk,
            st.foldcauchy,st.foldnorm,st.frechet_r,st.frechet_l,st.genlogistic,st.genpareto,st.gennorm,st.genexpon,
            st.genextreme,st.gausshyper,st.gamma,st.gengamma,st.genhalflogistic,st.gilbrat,st.gompertz,st.gumbel_r,
            st.gumbel_l,st.halfcauchy,st.halflogistic,st.halfnorm,st.halfgennorm,st.hypsecant,st.invgamma,st.invgauss,
            st.invweibull,st.johnsonsb,st.johnsonsu,st.ksone,st.kstwobign,st.laplace,st.levy,st.levy_l,st.levy_stable,
            st.logistic,st.loggamma,st.loglaplace,st.lognorm,st.lomax,st.maxwell,st.mielke,st.nakagami,st.ncx2,st.ncf,
            st.nct,st.norm,st.pareto,st.pearson3,st.powerlaw,st.powerlognorm,st.powernorm,st.rdist,st.reciprocal,
            st.rayleigh,st.rice,st.recipinvgauss,st.semicircular,st.t,st.triang,st.truncexpon,st.truncnorm,st.tukeylambda,
            st.uniform,st.vonmises,st.vonmises_line,st.wald,st.weibull_min,st.weibull_max,st.wrapcauchy
        ]
    
        # Best holders
        best_distribution = st.norm
        best_params = (0.0, 1.0)
        best_sse = np.inf
    
        # Estimate distribution parameters from data
        for distribution in DISTRIBUTIONS:
    
            # Try to fit the distribution
            try:
                # Ignore warnings from data that can't be fit
                with warnings.catch_warnings():
                    warnings.filterwarnings('ignore')
    
                    # fit dist to data
                    params = distribution.fit(data)
    
                    # Separate parts of parameters
                    arg = params[:-2]
                    loc = params[-2]
                    scale = params[-1]
    
                    # Calculate fitted PDF and error with fit in distribution
                    pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                    sse = np.sum(np.power(y - pdf, 2.0))
    
                    # if axis pass in add to plot
                    try:
                        if ax:
                            pd.Series(pdf, x).plot(ax=ax)
                        end
                    except Exception:
                        pass
    
                    # identify if this distribution is better
                    if best_sse > sse > 0:
                        best_distribution = distribution
                        best_params = params
                        best_sse = sse
    
            except Exception:
                pass
    
        return (best_distribution.name, best_params)
    
    def make_pdf(dist, params, size=10000):
        """Generate distributions's Probability Distribution Function """
    
        # Separate parts of parameters
        arg = params[:-2]
        loc = params[-2]
        scale = params[-1]
    
        # Get sane start and end points of distribution
        start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
        end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)
    
        # Build PDF and turn into pandas Series
        x = np.linspace(start, end, size)
        y = dist.pdf(x, loc=loc, scale=scale, *arg)
        pdf = pd.Series(y, x)
    
        return pdf
    
    # Load data from statsmodels datasets
    data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())
    
    # Plot for comparison
    plt.figure(figsize=(12,8))
    ax = data.plot(kind='hist', bins=50, normed=True, alpha=0.5, color=plt.rcParams['axes.color_cycle'][1])
    # Save plot limits
    dataYLim = ax.get_ylim()
    
    # Find best fit distribution
    best_fit_name, best_fit_params = best_fit_distribution(data, 200, ax)
    best_dist = getattr(st, best_fit_name)
    
    # Update plots
    ax.set_ylim(dataYLim)
    ax.set_title(u'El Niño sea temp.\n All Fitted Distributions')
    ax.set_xlabel(u'Temp (°C)')
    ax.set_ylabel('Frequency')
    
    # Make PDF with best params 
    pdf = make_pdf(best_dist, best_fit_params)
    
    # Display
    plt.figure(figsize=(12,8))
    ax = pdf.plot(lw=2, label='PDF', legend=True)
    data.plot(kind='hist', bins=50, normed=True, alpha=0.5, label='Data', legend=True, ax=ax)
    
    param_names = (best_dist.shapes + ', loc, scale').split(', ') if best_dist.shapes else ['loc', 'scale']
    param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_fit_params)])
    dist_str = '{}({})'.format(best_fit_name, param_str)
    
    ax.set_title(u'El Niño sea temp. with best fit distribution \n' + dist_str)
    ax.set_xlabel(u'Temp. (°C)')
    ax.set_ylabel('Frequency')
    
  • 9

    82 implemented distribution functions in SciPy 0.12.0 . 您可以使用fit() method测试其中一些数据是如何适合您的数据的 . 请查看以下代码了解更多详情:

    enter image description here

    import matplotlib.pyplot as plt
    import scipy
    import scipy.stats
    size = 30000
    x = scipy.arange(size)
    y = scipy.int_(scipy.round_(scipy.stats.vonmises.rvs(5,size=size)*47))
    h = plt.hist(y, bins=range(48), color='w')
    
    dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']
    
    for dist_name in dist_names:
        dist = getattr(scipy.stats, dist_name)
        param = dist.fit(y)
        pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
        plt.plot(pdf_fitted, label=dist_name)
        plt.xlim(0,47)
    plt.legend(loc='upper right')
    plt.show()
    

    参考文献:

    - Fitting distributions, goodness of fit, p-value. Is it possible to do this with Scipy (Python)?

    - Distribution fitting with Scipy

    这里有一个列表,其中包含Scipy 0.12.0(VI)中可用的所有分布函数的名称:

    dist_names = [ 'alpha', 'anglit', 'arcsine', 'beta', 'betaprime', 'bradford', 'burr', 'cauchy', 'chi', 'chi2', 'cosine', 'dgamma', 'dweibull', 'erlang', 'expon', 'exponweib', 'exponpow', 'f', 'fatiguelife', 'fisk', 'foldcauchy', 'foldnorm', 'frechet_r', 'frechet_l', 'genlogistic', 'genpareto', 'genexpon', 'genextreme', 'gausshyper', 'gamma', 'gengamma', 'genhalflogistic', 'gilbrat', 'gompertz', 'gumbel_r', 'gumbel_l', 'halfcauchy', 'halflogistic', 'halfnorm', 'hypsecant', 'invgamma', 'invgauss', 'invweibull', 'johnsonsb', 'johnsonsu', 'ksone', 'kstwobign', 'laplace', 'logistic', 'loggamma', 'loglaplace', 'lognorm', 'lomax', 'maxwell', 'mielke', 'nakagami', 'ncx2', 'ncf', 'nct', 'norm', 'pareto', 'pearson3', 'powerlaw', 'powerlognorm', 'powernorm', 'rdist', 'reciprocal', 'rayleigh', 'rice', 'recipinvgauss', 'semicircular', 't', 'triang', 'truncexpon', 'truncnorm', 'tukeylambda', 'uniform', 'vonmises', 'wald', 'weibull_min', 'weibull_max', 'wrapcauchy']
    
  • 3

    @Saullo Castro提到的 fit() 方法提供了最大似然估计(MLE) . 数据的最佳分布是给出最高的分布,可以通过几种不同的方式确定:例如

    1,给你最高对数可能性的那个 .

    2,给你最小的AIC,BIC或BICc值的那个(参见wiki:http://en.wikipedia.org/wiki/Akaike_information_criterion,基本上可以看作是参数数量调整的对数似然,因为预计更多参数的分布更合适)

    3,最大化贝叶斯后验概率的那个 . (见维基:http://en.wikipedia.org/wiki/Posterior_probability

    当然,如果您已经有一个应该描述数据的分布(基于您特定领域的理论)并且想要坚持这一点,那么您将跳过识别最佳拟合分布的步骤 .

    scipy 没有计算对数似然的函数(虽然提供了MLE方法),但硬代码很容易:参见Is the build-in probability density functions of scipy.stat.distributions slower than a user provided one?

  • 129

    AFAICU,您的发行版是离散的(除了离散之外) . 因此,只计算不同值的频率并将其标准化应足以满足您的需要 . 所以,举一个例子来证明这一点:

    In []: values= [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
    In []: counts= asarray(bincount(values), dtype= float)
    In []: cdf= counts.cumsum()/ counts.sum()
    

    因此,看到高于 1 的值的概率很简单(根据complementary cumulative distribution function (ccdf)

    In []: 1- cdf[1]
    Out[]: 0.40000000000000002
    

    请注意ccdfsurvival function (sf)密切相关,但它也使用离散分布定义,而sf仅针对连续分布定义 .

  • 0

    这听起来像是概率密度估计问题 .

    from scipy.stats import gaussian_kde
    occurences = [0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47]
    values = range(0,48)
    kde = gaussian_kde(map(float, occurences))
    p = kde(values)
    p = p/sum(p)
    print "P(x>=1) = %f" % sum(p[1:])
    

    另见http://jpktd.blogspot.com/2009/03/using-gaussian-kernel-density.html .

  • 2

    如果我不理解您的需要,请原谅我,但是如何将数据存储在字典中,其中键是0到47之间的数字,并且值是原始列表中相关键的出现次数?
    因此,您的似然p(x)将是大于x的键除以30000的所有值的总和 .

相关问题