首页 文章

使用Scikit-Learn GridSearchCV与PredefinedSplit进行交叉验证 - 可疑的交叉验证结果非常好

提问于
浏览
1

我'd like to use scikit-learn' s GridSearchCV 执行网格搜索并使用预定义的开发和验证拆分计算交叉验证错误(1倍交叉验证) .

我做错了什么,因为我的验证准确性非常高 . 在哪里我认为我将我的培训数据分成开发和验证集,培训开发集并在验证集上记录交叉验证分数 . 我的准确性可能会夸大,因为我实际上是在开发和验证集的混合培训,然后在验证集上进行测试 . 我'm not sure if I'm正确使用scikit-learn的 PredefinedSplit 模块 . 详情如下:

this answer之后,我做了以下事情:

import numpy as np
    from sklearn.model_selection import train_test_split, PredefinedSplit
    from sklearn.grid_search import GridSearchCV

    # I split up my data into training and test sets. 
    X_train, X_test, y_train, y_test = train_test_split(
        data[training_features], data[training_response], test_size=0.2, random_state=550)

    # sanity check - dimensions of training and test splits
    print(X_train.shape)
    print(X_test.shape)
    print(y_train.shape)
    print(y_test.shape)

    # dimensions of X_train and x_test are (323430, 26) and (323430,1) respectively
    # dimensions of X_test and y_test are (80858, 26) and (80858, 1)

    ''' Now, I define indices for a pre-defined split. 
    this is a 323430 dimensional array, where the indices for the development
    set are set to -1, and the indices for the validation set are set to 0.'''

    validation_idx = np.repeat(-1, y_train.shape)
    np.random.seed(550)
    validation_idx[np.random.choice(validation_idx.shape[0], 
           int(round(.2*validation_idx.shape[0])), replace = False)] = 0

    # Now, create a list which contains a single tuple of two elements, 
    # which are arrays containing the indices for the development and
    # validation sets, respectively.
    validation_split = list(PredefinedSplit(validation_idx).split())

    # sanity check
    print(len(validation_split[0][0])) # outputs 258744 
    print(len(validation_split[0][0]))/float(validation_idx.shape[0])) # outputs .8
    print(validation_idx.shape[0] == y_train.shape[0]) # True
    print(set(validation_split[0][0]).intersection(set(validation_split[0][1]))) # set([])

现在,我使用 GridSearchCV 运行网格搜索 . 我的意图是模型将适合网格上每个参数组合的开发集,并且当结果估计器应用于验证集时,将记录交叉验证分数 .

# a vanilla XGboost model
    model1 = XGBClassifier()

    # create a parameter grid for the number of trees and depth of trees
    n_estimators = range(300, 1100, 100)
    max_depth = [8, 10]
    param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)

    # A grid search. 
    # NOTE: I'm passing a PredefinedSplit object as an argument to the `cv` parameter.
    grid_search = GridSearchCV(model1, param_grid,
           scoring='neg_log_loss',
           n_jobs=-1, 
           cv=validation_split,
           verbose=1)

Now ,这里是为我筹集红旗的地方 . 我使用gridsearch找到的最佳估计器来查找验证集的准确性 . 它非常高 - 0.89207865689639176 . 's worse is that it'几乎与我在数据开发集上使用分类器时获得的准确性几乎相同(我刚刚训练过) - 0.89295597192591902 . BUT - 当我在真实测试集上使用分类器时,我得到的精度要低得多,大致为 .78

# accurracy score on the validation set. This yields .89207865
    accuracy_score(y_pred = 
           grid_result2.predict(X_train.iloc[validation_split[0][1]]),
           y_true=y_train[validation_split[0][1]])

    # accuracy score when applied to the development set. This yields .8929559
    accuracy_score(y_pred = 
           grid_result2.predict(X_train.iloc[validation_split[0][0]]),
           y_true=y_train[validation_split[0][0]])

    # finally, the score when applied to the test set. This yields .783 
    accuracy_score(y_pred = grid_result2.predict(X_test), y_true = y_test)

对我来说,应用于开发和验证数据集时模型的准确性与应用于测试集时的准确性的显着损失之间几乎完全一致,这清楚地表明我正在对事故验证数据进行培训,因此我的交叉验证分数不能代表模型的真实准确性 .

当它收到一个 PredefinedSplit 对象作为 cv 参数的参数时,我可以't seem to find where I went wrong - mostly because I don'知道 GridSearchCV 在做什么 .

我出错的任何想法?如果您需要更多细节/详细说明,请告诉我 . 代码也在this notebook on github.

谢谢!

1 回答

  • 2

    您需要设置 refit=False (不是默认选项),否则网格搜索将在网格搜索完成后重新调整整个数据集(忽略cv)上的估算器 .

相关问题