将sample_weight参数传递给GridSearchCV会因形状不正确而引发错误 . 我怀疑交叉验证无法相应地使用数据集处理sample_weights的拆分 .

第一部分：使用sample_weight作为模型参数可以很好地工作

让我们考虑一个简单的例子，首先没有GridSearch：

import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt


dataURL = 'https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sinusoidal_data.csv'

x = pd.read_csv(dataURL, usecols=["x"]).x
y = pd.read_csv(dataURL, usecols=["y"]).y
occurrences = pd.read_csv(dataURL, usecols=["Occurrences"]).Occurrences
my_sample_weights = (1 - occurrences/10000)**3

my_sample_weights 包含我在x，y中为每个观察分配的重要性，如下图所示 . 正弦曲线的点比形成背景噪声的点具有更高的权重 .

plt.scatter(x, y, c=my_sample_weights>0.9, cmap="cool")

Color coded dataset with respect to my_sample_weights

让我们训练一个神经网络，首先不使用 my_sample_weights 中包含的信息：

def make_model(number_of_hidden_neurons=1):
    model = Sequential()
    model.add(Dense(number_of_hidden_neurons, input_shape=(1,), activation='tanh'))
    model.add(Dense(1, activation='linear'))
    model.compile(optimizer='sgd', loss='mse')
    return model

net_Not_using_sample_weight = make_model(number_of_hidden_neurons=6)
net_Not_using_sample_weight.fit(x,y, epochs=1000)

plt.scatter(x, y, )
plt.scatter(x, net_Not_using_sample_weight.predict(x), c="green")

如下图所示，神经网络试图拟合正弦曲线的形状，但背景噪声使其无法很好地拟合 .
enter image description here

现在，使用 my_sample_weights 的信息，预测的质量要好得多 .
enter image description here

第二部分：使用sample_weight作为GridSearchCV参数会引发错误

my_Regressor = KerasRegressor(make_model)

validator = GridSearchCV(my_Regressor,
                     param_grid={'number_of_hidden_neurons': range(4, 5),
                                 'epochs': [500],
                                },
                     fit_params={'sample_weight': [ my_sample_weights ]},
                     n_jobs=1,
                    )
validator.fit(x, y)

尝试将sample_weights作为参数传递会产生以下错误：

...
ValueError: Found a sample_weight array with shape (1000,) for an input with shape (666, 1). sample_weight cannot be broadcast.

似乎sample_weight向量没有以与输入数组类似的方式拆分 .

值得的是：

import sklearn
print(sklearn.__version__)
0.18.1

import keras
print(keras.__version__)
2.0.5

2 回答

1
问题在于，作为标准，GridSearch使用3倍交叉验证，除非明确说明 . 这意味着数据的2/3数据点用作训练数据，交叉验证使用1/3，这符合错误消息 . fit_params的1000的输入形状与用于训练的训练示例的数量不匹配（666） . 调整大小，代码将运行 .
```
my_sample_weights = np.random.uniform(size=666)
```
回复于 2024-04-28T18:08:03+08:00
1

我们开发了PipeGraph，它是Scikit-Learn Pipeline的扩展，允许您获取中间数据，构建类似工作流的图形，特别是解决此问题（请参阅图库中的示例http://mcasl.github.io/PipeGraph）

回复于 2024-04-28T18:08:03+08:00

scikit-learn GridSearchCV中的sample_weight参数形状错误

第一部分：使用sample_weight作为模型参数可以很好地工作

第二部分：使用sample_weight作为GridSearchCV参数会引发错误

2 回答

相关问题