我们在Azure ML Studio工作平台（最初的拖放系统）之上运行了一些ML模型 . 一年多来都很好，但我们需要继续前进，以便我们能够更好地扩展 . 所以我正在使用scikit-learn在Jupyter笔记本中重写它们 .

好消息/坏消息是我们要训练的数据相当小（数据库中有数百条记录） . 它's very imperfect data making very imperfect regression predictions, so error is to be expected. And that'很好 . 而对于这个问题，它明白了'm doing wrong, but I'明显做错了 something .

值得怀疑的明显事情（在我看来）是我通过相关性发现的明显/完美的因果关系 . 我对 train_test_split 的使用告诉我，'m not training on my test data and I guarantee the second is false because of how messy this space is (we started doing manual linear regression on this data about 15 years ago, and still maintain Excel spreadsheets to be able to manually do it in a pinch, even if it'的精确度远低于我们的Azure ML Studio模型 .

我们来看看代码吧 . 这是我的Jupyter笔记本的相关部分（对不起，如果有更好的格式化方法）：

X = myData
y = myData.ValueToPredict
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    train_size = 0.75,
    test_size = 0.25)
print("X_train: ", X_train.shape)
print("y_train: ", y_train.shape)
print("X_test:  ", X_test.shape)
print("y_test:  ", y_test.shape)

X_train：（300,17）y_train：（300，）X_test：（101,17）y_test：（101，）

ESTIMATORS = {
    "Extra Trees": ExtraTreesRegressor(criterion = "mse",
                                       n_estimators=10,
                                       max_features=16,
                                       random_state=42),
    "Decision Tree": DecisionTreeRegressor(criterion = "mse",
                                  splitter = "best",
                                       random_state=42),
    "Random Forest": RandomForestRegressor(criterion = "mse",
                                       random_state=42),
    "Linear regression": LinearRegression(),
    "Ridge": RidgeCV(),
}

y_test_predict = dict()
y_test_rmse = dict()
for name, estimator in ESTIMATORS.items():
    estimator.fit(X_train, y_train)
    y_test_predict[name] = estimator.predict(X_test)
    y_test_rmse[name] = np.sqrt(np.mean((y_test - y_test_predict[name]) ** 2)) # I think this might be wrong but isn't the source of my problem
for name, error in y_test_rmse.items():
    print(name + " RMSE: " + str(error))

额外树RMSE：0.384354083868615157决策树RMSE：0.32838969545222946随机森林RMSE：0.4304701784728594线性回归RMSE：7.971345895791494e-15 Ridge RMSE：0.0001390197344951183

y_test_score = dict()
for name, estimator in ESTIMATORS.items():
    estimator.fit(X_train, y_train)
    y_test_predict[name] = estimator.predict(X_test)
    y_test_score[name] = estimator.score(X_test, y_test)
for name, error in y_test_score.items():
    print(name + " Score: " + str(error))

额外树木得分：0.9990166492769291决策树得分：0.999282165241745随机森林得分：0.998766521504593线性回归得分：1.0岭分数：0.9999999998713534

我想也许我的错误指标错误，所以我只看了简单的分数（这就是为什么我包括两者） . 然而，两者都表明这些预测太好了，不可能成真 . 请记住，输入量很小（总共约400项？） . 而且这个数据运行的数据基本上是根据天气模式预测商品消费，这是一个混乱的空间，因此应该存在大量的错误 .

我在这做错了什么？

（另外，如果我能以更好的方式提出这个问题或者提供更多有用的信息，我会非常感激！）

这是数据的热图 . 我指出了我们预测的 Value .

Seaborn heatmap of the data

我还绘制了几个更重要的输入与我们预测的值（由另一个维度进行颜色编码）：

A plot of values we're predicting

这是第2列，正如评论中所述
Another plot

解决方案！

正如@jwil指出的那样，我没有从我的 X 变量中拉出 ValueToPredict 列 . 解决方案是添加单线，以删除该列：

X = myData
y = myData.ValueToPredict
X = X.drop("ValueToPredict", 1) # <--- ONE-LINE FIX!
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    train_size = 0.75,
    test_size = 0.25)

有了这个，我的错误和分数远远超出我的预期：

其他树木RMSE：1.6170428819849574决策树RMSE：1.990459810552763随机森林RMSE：1.699801032532343线性回归RMSE：2.5265108241534397岭RMSE：2.528721533965162额外树分数：0.9825944193611161决策树得分：0.9736274412836977随机森林得分：0.9807672396970707线性回归得分：0.9575098985510281岭得分：0.9574355079097321

1 回答

1

你是对的;我强烈怀疑你的X数据中有一个或多个特征与Y数据几乎完全相关 . 通常这很糟糕，因为这些变量不能解释Y，但要么用Y解释，要么用Y联合确定 . 要解决这个问题，考虑在X上执行Y的线性回归，然后使用简单的p值或AIC / BIC来确定哪些X变量最不相关 . 放下这些并重复这个过程，直到你的R ^ 2开始严重下降（虽然它每次都会下降一点） . 剩下的变量将是预测中最相关的，并且希望您能够从该子集中识别哪些变量与Y紧密相关 .

回复于 2024-05-06T07:18:10+08:00

scikit-learn回归预测结果太好了 . 我搞砸了什么？

解决方案！

1 回答

相关问题