首页 文章

使用Pearson相关和python线性回归进行简单预测

提问于
浏览
0

我有这样的数据集

Value   Month       Year 

    103.4   April       2006
    270.6   August      2006
    51.9    December    2006
    156.9   February    2006
    126.9   January     2006
    96.8    July        2006
    183.1   June        2006
    266.6   March       2006
    193.1   May         2006
    524.7   November    2006
    619.9   October     2006
    129     September   2006
    374.1   April       2007
    260.5   August      2007
    119.6   December    2007
    9.9     February    2007
    91.1    January     2007
    106.6   July        2007
    79.9    June        2007
    60.5    March       2007
    432.4   May         2007
    128.8   November    2007
    292.1   October     2007
    129.3   September   2007

Value 是一个地区的年降雨量 . 我们称之为districtA . 我有2006年至2014年的数据集,我需要预测区域A的未来2年的降雨量 . 我从sklearn libary中选择pearson相关和线性回归来预测数据 . 我很困惑,我不知道如何设置X和Y.我是Python的新手,所以每一个帮助都是有 Value 的 . 谢谢你

ps ..我发现了这样的代码

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# Load the diabetes dataset
diabetes = datasets.load_diabetes()


# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
         linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

当我打印糖尿病_X_train它给了我这个

[[ 0.07786339]
 [-0.03961813]
 [ 0.01103904]
 [-0.04069594]
 [-0.03422907]...]

我假设这是从相关性和系数得到的r值 . 当我打印糖尿病_Y_train它给我这样的东西

[ 233.   91.  111.  152.  120.  .....]

我的问题是如何从降雨中获得r值并将其分配给x轴

1 回答

  • 0

    没有最好的解决方案,但它有效 .

    小解释:我已经在列表中的索引上替换了月份,这对于算法是必要的 . 我还在';'上替换了空格分隔符delimeters,因为在不同的行中有不同的空格数并且不方便 . 现在您的数据是:

    Value;Month;Year 
    103.4;April;2006
    270.6;August;2006
    51.9;December;2006
    

    初始数据的文件是'data.csv' .

    import pandas as pd
    import sklearn.linear_model as ll
    
    data = pd.read_csv('data.csv', sep=';')
    X = data.ix[:,1:3]
    y = data.ix[:,0]
    
    month = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
    
    for i, m in enumerate(data.ix[:,1]):
        data.ix[i,1] = month.index(m)
    
    X = data.ix[:,1:3]
    lr = ll.LinearRegression()
    lr.fit(X, y)
    
    ######### TEST DATA ##########
    X_test = [[1, 2008], [2, 2008]]
    X_test = pd.DataFrame(X_test, columns=['Month', 'Year'])
    
    y_test = lr.predict(X_test)
    print(y_test)
    

    经过测试,我得到了这个值

    [69.23079837  80.63691725]
    

相关问题