使用RandomizedSearchCV对XGBClassifier进行Python超参数优化-Java 学习之路

我试图获得XGBClassifier的最佳超参数，这将导致获得最具预测性的属性 . 我试图使用RandomizedSearchCV迭代并通过KFold进行验证 .

当我运行此过程总共5次（numFolds = 5）时，我希望将最佳结果保存在名为collector（下面指定）的数据框中 . 所以每次迭代，我都希望得到最好的结果和得分以附加到收集器数据帧 .

from scipy import stats
 from scipy.stats import randint
 from sklearn.model_selection import RandomizedSearchCV
 from sklearn.metrics import 
 precision_score,recall_score,accuracy_score,f1_score,roc_auc_score

clf_xgb = xgb.XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': stats.randint(150, 1000),
              'learning_rate': stats.uniform(0.01, 0.6),
              'subsample': stats.uniform(0.3, 0.9),
              'max_depth': [3, 4, 5, 6, 7, 8, 9],
              'colsample_bytree': stats.uniform(0.5, 0.9),
              'min_child_weight': [1, 2, 3, 4]
             }
clf = RandomizedSearchCV(clf_xgb, param_distributions = param_dist, n_iter = 25, scoring = 'roc_auc', error_score = 0, verbose = 3, n_jobs = -1)

numFolds = 5
folds = cross_validation.KFold(n = len(X), shuffle = True, n_folds = numFolds)

collector = pd.DataFrame()
estimators = []
results = np.zeros(len(X))
score = 0.0

for train_index, test_index in folds:
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf.fit(X_train, y_train)
    estimators.append(clf.best_estimator_)
    estcoll = pd.DataFrame(estimators)


    estcoll['score'] = score
    pd.concat([collector,estcoll])
    print "\n", len(collector), "\n"
score /= numFolds

由于某种原因，没有任何内容保存到数据框中，请帮忙 .

此外，我有大约350个属性循环通过列车3.5K行和2K测试 . 通过贝叶斯超参数优化过程运行这可能会改善我的结果吗？或者它只会节省处理时间？

1 回答

4
RandomizedSearchCV() 会为你做的比你意识到的更多 . 探索拟合的CV对象的 cv_results 属性at the documentation page

这里的代码几乎没有变化 . 我添加的两个更改：
- 我从25更改了 n_iter=5 . 这将执行5组参数，使用5倍交叉验证意味着总共25个参数 .
- 我在RandomizedSearchCV之前定义了你的 kfold 对象，然后在RandomizedSearchCV的构造中引用它作为 cv param
_
```
clf_xgb = xgb.XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': stats.randint(150, 1000),
              'learning_rate': stats.uniform(0.01, 0.6),
              'subsample': stats.uniform(0.3, 0.9),
              'max_depth': [3, 4, 5, 6, 7, 8, 9],
              'colsample_bytree': stats.uniform(0.5, 0.9),
              'min_child_weight': [1, 2, 3, 4]
             }

numFolds = 5
kfold_5 = cross_validation.KFold(n = len(X), shuffle = True, n_folds = numFolds)

clf = RandomizedSearchCV(clf_xgb, 
                         param_distributions = param_dist,
                         cv = kfold_5,  
                         n_iter = 5, # you want 5 here not 25 if I understand you correctly 
                         scoring = 'roc_auc', 
                         error_score = 0, 
                         verbose = 3, 
                         n_jobs = -1)
```
这是我的答案显着偏离您的代码的地方 . 只需适合 randomizedsearchcv 对象一次，无需循环 . 它使用 cv 参数处理CV循环 .
```
clf.fit(X_train, y_train)
```
您的所有交叉验证结果现在都在 clf.cv_results_ 中 . 例如，您可以使用 clf.cv_results_['mean_test_score'] 进行交叉验证（平均5倍折叠）列车得分： clf.cv_results_['mean_train_score'] 或交叉验证的测试集（保持数据）得分 clf.cv_results_['mean_test_score'] . 您还可以获得其他有用的东西，如 mean_fit_time ， params 和 clf ，一旦安装，将自动记住您的 best_estimator_ 作为属性 .

这些与确定模型拟合的最佳超参数集相关 . 对于来自 n_iter 的单次迭代中使用的5倍中的每一个，单个超参数集是恒定的，因此您不必在迭代内对齐折叠之间的不同分数 .
回复于 2024-05-03T08:46:05+08:00

使用RandomizedSearchCV对XGBClassifier进行Python超参数优化

1 回答

相关问题