我正在尝试将分布拟合到我的数据中 . 我将数据分成四个不同的长度(99.9%,99.0%,95.0%和90.0%长度) . 使用SciPy拟合方法计算拟合 . 我的问题是,数据长度越短,拟合结果越差 . 但是如果我的数据间隔较短,那么拟合应该更容易 . 与R²和SSE进行比较 .

适合99.9%的数据:

配件

n4, bins4, patches4 = plt.hist(h4, bins=binwidth, normed=1, facecolor='#023d6b', alpha=0.5, histtype='bar')

lnspc =np.arange(0,int(out_threshold4)-0.5,0.5)

m,s = stats.norm.fit(h4)
pdf_g=stats.norm.pdf(lnspc,m,s)
#plt.plot(lnspc,pdf_g, label="Norm")

ag,bg,cg = stats.gamma.fit(h4)  
pdf_gamma = stats.gamma.pdf(lnspc, ag, bg,cg)  
plt.plot(lnspc, pdf_gamma, label="Gamma")

ab,bb,cb,db = stats.beta.fit(h4)  
pdf_beta = stats.beta.pdf(lnspc, ab, bb,cb, db)  
#plt.plot(lnspc, pdf_beta, label="Beta")

gevfit = gev.fit(h4)  
pdf_gev = gev.pdf(lnspc, *gevfit)   
plt.plot(lnspc, pdf_gev, label="GEV")

logfit = stats.lognorm.fit(h4)  
pdf_lognorm = stats.lognorm.pdf(lnspc, *logfit)  
plt.plot(lnspc, pdf_lognorm, label="LogNormal")

weibfit = stats.weibull_min.fit(h4,loc=0.1)  
pdf_weib = stats.weibull_min.pdf(lnspc, *weibfit)  
plt.plot(lnspc, pdf_weib, label="Weibull")

burr12fit = stats.burr12.fit(h4,loc=0.1)  
pdf_burr12 = stats.burr12.pdf(lnspc, *burr12fit)  
plt.plot(lnspc, pdf_burr12, label="Burr")

genparetofit = stats.genpareto.fit(h4)
pdf_genpareto = stats.genpareto.pdf(lnspc, *genparetofit)
plt.plot(lnspc, pdf_genpareto, label ="Gen-Pareto")

myarray = np.array(h4)   

clf = GMM(8,n_iter=500, random_state=3)
myarray.shape = (myarray.shape[0],1)
clf = clf.fit(myarray)
lnspc.shape = (lnspc.shape[0],1)
pdf_gmm = np.exp(clf.score(lnspc))
plt.plot(lnspc, pdf_gmm, label = "GMM")

R²的计算

df4 = pd.DataFrame({'Strecke': bins4[:-1]+1, 'Propability': n4})

slope, intercept, r_value_norm4, p_value, std_err = stats.linregress(df4['Propability'],pdf_g)
#print ("R-squared Normal Distribution:", r_value_norm**2)

slope, intercept, r_value_gamma4, p_value, std_err = stats.linregress(df4['Propability'],pdf_gamma)
#print ("R-squared Gamma Distribution:", r_value_gamma**2)

slope, intercept, r_value_beta4, p_value, std_err = stats.linregress(df4['Propability'],pdf_beta)
#print ("R-squared Beta Distribution:", r_value_beta**2)

slope, intercept, r_value_gev4, p_value, std_err = stats.linregress(df4['Propability'],pdf_gev)
#print ("R-squared GEV Distribution:", r_value_gev**2)

slope, intercept, r_value_lognorm4, p_value, std_err = stats.linregress(df4['Propability'],pdf_lognorm)
#print ("R-squared LogNormal Distribution:", r_value_lognorm**2)

slope, intercept, r_value_weibull4, p_value, std_err = stats.linregress(df4['Propability'],pdf_weib)
#print ("R-squared Weibull Distribution:", r_value_weibull**2)

slope, intercept, r_value_burr12ull4, p_value, std_err = stats.linregress(df4['Propability'],pdf_burr12)

slope, intercept, r_value_genpareto4, p_value, std_err = stats.linregress(df4['Propability'],pdf_genpareto)

slope, intercept, r_value_gmm4, p_value, std_err = stats.linregress(df4['Propability'].values,pdf_gmm)

上海证券交易所的计算

for j in range(0,len(df4['Propability'])-1):

 s4rss_norm += (df4['Propability'].iloc[j+1] - pdf_g[j+1])**2
 s4rss_gamma += (df4['Propability'].iloc[j+1] - pdf_gamma[j+1])**2
 s4rss_beta += (df4['Propability'].iloc[j+1] - pdf_beta[j+1])**2
 s4rss_gev += (df4['Propability'].iloc[j+1] - pdf_gev[j+1])**2
 s4rss_lognorm += (df4['Propability'].iloc[j+1] - pdf_lognorm[j+1])**2
 s4rss_weib += (df4['Propability'].iloc[j+1] - pdf_weib[j+1])**2
 s4rss_burr12 += (df4['Propability'].iloc[j+1] - pdf_burr12[j+1])**2
 s4rss_genpareto += (df4['Propability'].iloc[j+1] - pdf_genpareto[j+1])**2
 s4rss_gmm += (df4['Propability'].iloc[j+1] - pdf_gmm[j+1])**2

例如:90.0%的计算与代表性数据类似地进行

下图显示了拟合,以及得到的R²和SSE .

Distribution fitting

R² and SSe results

我的主管和我在想,90%的数据的拟合应该更好,因为异常值会被削减 . 较小数据范围的SSE值也应该更小,因为要检查的数据点较少?是否还有其他适合这些发行版的选项?我错过了什么?