首页 文章

Python中的多元线性回归

提问于
浏览
97

我似乎找不到任何进行多重回归的python库 . 我发现的唯一的东西只做简单的回归 . 我需要对几个自变量(x1,x2,x3等)回归我的因变量(y) .

例如,使用此数据:

print 'y        x1      x2       x3       x4      x5     x6       x7'
for t in texts:
    print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
   .format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)

(以上输出:)

y        x1       x2       x3        x4     x5     x6       x7
   -6.0     -4.95    -5.87    -0.76     14.73   4.02   0.20     0.45
   -5.0     -4.55    -4.52    -0.71     13.74   4.47   0.16     0.50
  -10.0    -10.96   -11.64    -0.98     15.49   4.18   0.19     0.53
   -5.0     -1.08    -3.36     0.75     24.72   4.96   0.16     0.60
   -8.0     -6.52    -7.45    -0.86     16.59   4.29   0.10     0.48
   -3.0     -0.81    -2.36    -0.50     22.44   4.81   0.15     0.53
   -6.0     -7.01    -7.33    -0.33     13.93   4.32   0.21     0.50
   -8.0     -4.46    -7.65    -0.94     11.40   4.43   0.16     0.49
   -8.0    -11.54   -10.03    -1.03     18.18   4.28   0.21     0.55

我如何在python中回归这些,得到线性回归公式:

Y = a1x1 a2x2 a3x3 a4x4 a5x5 a6x6 a7x7 c

10 回答

  • 11

    sklearn.linear_model.LinearRegression会这样做:

    from sklearn import linear_model
    clf = linear_model.LinearRegression()
    clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
            [t.y for t in texts])
    

    然后 clf.coef_ 将具有回归系数 .

    sklearn.linear_model也有类似的接口,可以对回归进行各种规范化 .

  • 54

    这是我创建的一个小工作 . 我用R检查了它,它的工作正确 .

    import numpy as np
    import statsmodels.api as sm
    
    y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]
    
    x = [
         [4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
         [4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
         [4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
         ]
    
    def reg_m(y, x):
        ones = np.ones(len(x[0]))
        X = sm.add_constant(np.column_stack((x[0], ones)))
        for ele in x[1:]:
            X = sm.add_constant(np.column_stack((ele, X)))
        results = sm.OLS(y, X).fit()
        return results
    

    结果:

    print reg_m(y, x).summary()
    

    输出:

    OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                      y   R-squared:                       0.535
    Model:                            OLS   Adj. R-squared:                  0.461
    Method:                 Least Squares   F-statistic:                     7.281
    Date:                Tue, 19 Feb 2013   Prob (F-statistic):            0.00191
    Time:                        21:51:28   Log-Likelihood:                -26.025
    No. Observations:                  23   AIC:                             60.05
    Df Residuals:                      19   BIC:                             64.59
    Df Model:                           3                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
    ------------------------------------------------------------------------------
    x1             0.2424      0.139      1.739      0.098        -0.049     0.534
    x2             0.2360      0.149      1.587      0.129        -0.075     0.547
    x3            -0.0618      0.145     -0.427      0.674        -0.365     0.241
    const          1.5704      0.633      2.481      0.023         0.245     2.895
    
    ==============================================================================
    Omnibus:                        6.904   Durbin-Watson:                   1.905
    Prob(Omnibus):                  0.032   Jarque-Bera (JB):                4.708
    Skew:                          -0.849   Prob(JB):                       0.0950
    Kurtosis:                       4.426   Cond. No.                         38.6
    

    pandas 提供了一种方便的方式来运行OLS,如下面的答案所示:

    Run an OLS regression with Pandas Data Frame

  • 37

    只是为了澄清,你给出的例子是多元线性回归,而不是多元线性回归参考 . Difference

    单个标量预测变量x和单个标量响应变量y的最简单情况称为简单线性回归 . 多个和/或向量值预测变量的扩展(用大写X表示)称为多元线性回归,也称为多变量线性回归 . 几乎所有真实世界的回归模型都涉及多个预测变量,线性回归的基本描述通常用多元回归模型来表达 . 但请注意,在这些情况下,响应变量y仍然是标量 . 另一个术语多元线性回归指的是y是矢量的情况,即与一般线性回归相同的情况 . 应强调多元线性回归与多变量线性回归之间的差异,因为它会引起文献中的许多混淆和误解 .

    简而言之:

    • 多元线性回归:响应y是标量 .

    • 多元线性回归:响应y是向量 .

    (另一个source . )

  • 3

    我认为这可能是完成这项工作最简单的方法:

    from random import random
    from pandas import DataFrame
    from statsmodels.api import OLS
    lr = lambda : [random() for i in range(100)]
    x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
    x['b'] = 1
    y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4
    
    print x.head()
    
             x1        x2        x3  b
    0  0.433681  0.946723  0.103422  1
    1  0.400423  0.527179  0.131674  1
    2  0.992441  0.900678  0.360140  1
    3  0.413757  0.099319  0.825181  1
    4  0.796491  0.862593  0.193554  1
    
    print y.head()
    
    0    6.637392
    1    5.849802
    2    7.874218
    3    7.087938
    4    7.102337
    dtype: float64
    
    model = OLS(y, x)
    result = model.fit()
    print result.summary()
    
                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                      y   R-squared:                       1.000
    Model:                            OLS   Adj. R-squared:                  1.000
    Method:                 Least Squares   F-statistic:                 5.859e+30
    Date:                Wed, 09 Dec 2015   Prob (F-statistic):               0.00
    Time:                        15:17:32   Log-Likelihood:                 3224.9
    No. Observations:                 100   AIC:                            -6442.
    Df Residuals:                      96   BIC:                            -6431.
    Df Model:                           3                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
    ------------------------------------------------------------------------------
    x1             1.0000   8.98e-16   1.11e+15      0.000         1.000     1.000
    x2             2.0000   8.28e-16   2.41e+15      0.000         2.000     2.000
    x3             3.0000   8.34e-16    3.6e+15      0.000         3.000     3.000
    b              4.0000   8.51e-16    4.7e+15      0.000         4.000     4.000
    ==============================================================================
    Omnibus:                        7.675   Durbin-Watson:                   1.614
    Prob(Omnibus):                  0.022   Jarque-Bera (JB):                3.118
    Skew:                           0.045   Prob(JB):                        0.210
    Kurtosis:                       2.140   Cond. No.                         6.89
    ==============================================================================
    
  • 7

    你可以使用numpy.linalg.lstsq

    import numpy as np
    y = np.array([-6,-5,-10,-5,-8,-3,-6,-8,-8])
    X = np.array([[-4.95,-4.55,-10.96,-1.08,-6.52,-0.81,-7.01,-4.46,-11.54],[-5.87,-4.52,-11.64,-3.36,-7.45,-2.36,-7.33,-7.65,-10.03],[-0.76,-0.71,-0.98,0.75,-0.86,-0.50,-0.33,-0.94,-1.03],[14.73,13.74,15.49,24.72,16.59,22.44,13.93,11.40,18.18],[4.02,4.47,4.18,4.96,4.29,4.81,4.32,4.43,4.28],[0.20,0.16,0.19,0.16,0.10,0.15,0.21,0.16,0.21],[0.45,0.50,0.53,0.60,0.48,0.53,0.50,0.49,0.55]])
    X = X.T # transpose so input vectors are along the rows
    X = np.c_[X, np.ones(X.shape[0])] # add bias term
    beta_hat = np.linalg.lstsq(X,y)[0]
    print beta_hat
    

    结果:

    [ -0.49104607   0.83271938   0.0860167    0.1326091    6.85681762  22.98163883 -41.08437805 -19.08085066]
    

    您可以通过以下方式查看估算输出:

    print np.dot(X,beta_hat)
    

    结果:

    [ -5.97751163,  -5.06465759, -10.16873217,  -4.96959788,  -7.96356915,  -3.06176313,  -6.01818435,  -7.90878145,  -7.86720264]
    
  • 3

    您可以使用下面的函数并将其传递给DataFrame:

    def linear(x, y=None, show=True):
        """
        @param x: pd.DataFrame
        @param y: pd.DataFrame or pd.Series or None
                  if None, then use last column of x as y
        @param show: if show regression summary
        """
        import statsmodels.api as sm
    
        xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
        res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()
    
        if show: print res.summary()
        return res
    
  • 83

    使用 scipy.optimize.curve_fit . 而且不仅仅是线性适合 .

    from scipy.optimize import curve_fit
    import scipy
    
    def fn(x, a, b, c):
        return a + b*x[0] + c*x[1]
    
    # y(x0,x1) data:
    #    x0=0 1 2
    # ___________
    # x1=0 |0 1 2
    # x1=1 |1 2 3
    # x1=2 |2 3 4
    
    x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
    y = scipy.array([0,1,2,1,2,3,2,3,4])
    popt, pcov = curve_fit(fn, x, y)
    print popt
    
  • 3

    将数据转换为pandas数据帧( df )后,

    import statsmodels.formula.api as smf
    lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()
    print(lm.params)
    

    截距项默认包含在内 .

    有关更多示例,请参阅this notebook .

  • 23

    可以使用上面引用的sklearn库来处理多元线性回归 . 我正在使用Anaconda安装的Python 3.6 .

    按如下方式创建模型:

    from sklearn.linear_model import LinearRegression
    regressor = LinearRegression()
    regressor.fit(X, y)
    
    # display coefficients
    print(regressor.coef_)
    
  • 1

    你可以用numpy.linalg.lstsq

相关问题