Python，机器学习 - 在自定义验证集上执行网格搜索-Java 学习之路

我正在处理一个不 balancer 的分类问题，我的负面课程比我的正面课程多1000倍 . 我的策略是在 balancer （50/50比率）训练集上训练深度神经网络（我有足够的模拟样本），然后使用不 balancer （1/1000比率）验证集来选择最佳模型并优化超参数 .

由于参数的数量很大，我想使用scikit-learn RandomizedSearchCV，即随机网格搜索 .

据我所知，sk-learn GridSearch在训练集上应用度量以选择最佳的超参数集 . 然而，就我而言，这意味着GridSearch将选择对 balancer 训练集表现最佳的模型，而不是针对更现实的不 balancer 数据 .

我的问题是：有没有一种方法可以使用特定的，用户定义的验证集估计的性能进行网格搜索？

1 回答

正如评论中所建议的那样，你需要的是PredefinedSplit . 它在question here中描述

关于工作，您可以看到文档中给出的示例：

from sklearn.model_selection import PredefinedSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])

#This is what you need
test_fold = [0, 1, -1, 1]

ps = PredefinedSplit(test_fold)
ps.get_n_splits()
#OUTPUT
2

for train_index, test_index in ps.split():
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

#OUTPUT
TRAIN: [1 2 3] TEST: [0]
TRAIN: [0 2] TEST: [1 3]

正如您在此处所见，您需要为 test_fold 分配索引列表，这些索引将用于拆分数据 . -1将用于样本索引，不包括在验证集中 .

所以在上面的代码中， test_fold = [0, 1, -1, 1] 表示在第一个验证集中（样本中的索引，其值在 test_fold 中为0），索引为0.而第二个是test_fold具有值= 1的位置，因此索引1和3 .

但是，如果您说 X_train 和 X_test ，如果您只想从 X_test 进行验证设置，那么您需要执行以下操作：

my_test_fold = []

# put -1 here, so they will be in training set
for i in range(len(X_train)):
    my_test_fold.append(-1)

# for all greater indices, assign 0, so they will be put in test set
for i in range(len(X_test)):
    my_test_fold.append(0)

#Combine the X_train and X_test into one array:
import numpy as np

clf = RandomizedSearchCV( ...    cv = PredefinedSplit(test_fold=my_test_fold))
clf.fit(np.concatenate((X_train, X_test), axis=0), np.concatenate((y_train, y_test), axis=0))

回复于 2024-04-26T22:49:04+08:00

Python，机器学习 - 在自定义验证集上执行网格搜索

1 回答

相关问题