首页 文章

scikit-learn错误:y中填充最少的类只有1个成员

提问于
浏览
0

我正在尝试使用scikit-learn中的train_test_split函数将我的数据集拆分为训练和测试集,但我收到此错误:

In [1]: y.iloc[:,0].value_counts()
Out[1]: 
M2    38
M1    35
M4    29
M5    15
M0    15
M3    15

In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y)
Out[2]: 
Traceback (most recent call last):
  File "run_ok.py", line 48, in <module>
    xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85,stratify=y)
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1700, in train_test_split
    train, test = next(cv.split(X=arrays[0], y=stratify))
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 953, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1259, in _iter_indices
    raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

但是,所有课程至少有15个样本 . 为什么我收到此错误?

X是表示数据点的pandas DataFrame,y是一个pandas DataFrame,其中一列包含目标变量 .

我不能发布原始数据,因为它是专有的,但它通过创建一个具有1k行x 500列的随机pandas DataFrame(X)和一个具有相同行数(1k)X的随机pandas DataFrame(y)而具有相当的可重现性 . ,以及每行的目标变量(分类标签) . y pandas DataFrame应具有不同的分类标签(例如'class1','class2'...),每个标签应至少出现15次 .

1 回答

  • 2

    问题是 train_test_split 将输入2个数组作为输入,但 y 数组是一列矩阵 . 如果我只传递 y 的第一列,它就可以了 .

    train, xtest, ytrain, ytest = train_test_split(X, y.iloc[:,1], test_size=1/3,
      random_state=85, stratify=y.iloc[:,1])
    

相关问题