学习曲线显示过度拟合吗？-Java 学习之路

我想知道我的分类模型（二进制）是否有过度拟合，我得到了学习曲线 . 数据集为：6836个实例，其中有1006个正面类别 .

1）如果我使用SMOTE来 balancer 类和RandomForest作为技术，我得到这个曲线，这些比率：TPR = 0.887 y FPR = 0.041：

Learning curve 1

注意 training error is flat 几乎为0 .

2）如果我使用函数“balanced_subsample”（在末尾附加）来 balancer 类和RandomForest作为技术，我得到这个曲线，这些比率：TPR = 0.866 y FPR = 0.14：

Learning curve 2

请注意，在这种情况下 test error is flat .

模型是否会过度拟合？
哪一个更有意义？

功能“balanced_subsample”：

def balanced_subsample(x,y,subsample_size):

class_xs = []
min_elems = None

for yi in np.unique(y):
    elems = x[(y == yi)]
    class_xs.append((yi, elems))
    if min_elems == None or elems.shape[0] < min_elems:
        min_elems = elems.shape[0]

use_elems = min_elems
if subsample_size < 1:
    use_elems = int(min_elems*subsample_size)

xs = []
ys = []

for ci,this_xs in class_xs:
    if len(this_xs) > use_elems:
        np.random.shuffle(this_xs)

    x_ = this_xs[:use_elems]
    y_ = np.empty(use_elems)
    y_.fill(ci)

    xs.append(x_)
    ys.append(y_)

xs = np.concatenate(xs)
ys = np.concatenate(ys)

return xs,ys

EDIT1: More info about the code ans the process

X = data
y = X.pop('myclass')


#There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes
arrX = vectorize_attributes(X)

#Here I use some code to balance my class using SMOTE or "balanced_subsample" approach
X_train_balanced, y_train_balanced=mySMOTEfunc(arrX, y)
#X_train_balanced, y_train_balanced=balanced_subsample(arrX, y) 

#TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit)
X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25)

#Estimator
clf=RandomForestClassifier(random_state=np.random.seed()) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}

#Grid search
score_func = metrics.f1_score
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10)
start = time()
CV_clf.fit(X_train, y_train)

#FIT & PREDICTION
model = CV_clf.best_estimator_
y_pred = model.predict(X_test)

EDIT2: In this case, I try it with Gradient Boosting Classifier (GBC) in 3 scenarios: 1) GBC + SMOTE, 2) GBC + SMOTE + feature selection, and 3) GBC + SMOTE + feature selection + normalization

X = data
y = X.pop('myclass')

#There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes
arrX = vectorize_attributes(X)

#FOR SCENARIO 3: Normalization
standardized_X = preprocessing.normalize(arrX)

#FOR SCENARIO 2 y 3: Removing all but the k highest scoring features
arrX_features_selected = SelectKBest(chi2, k=5).fit_transform(standardized_X , y)

#Here I use some code to balance my class using SMOTE or "balanced_subsample" approach
X_train_balanced, y_train_balanced=mySMOTEfunc(arrX_features_selected , y)
#X_train_balanced, y_train_balanced=balanced_subsample(arrX_features_selected , y) 

#TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit)
X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25)

#Estimator
clf=RandomForestClassifier(random_state=np.random.seed()) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}

#Grid search
score_func = metrics.f1_score
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10)
start = time()
CV_clf.fit(X_train, y_train)

#FIT & PREDICTION
model = CV_clf.best_estimator_
y_pred = model.predict(X_test)

3个提议方案的学习曲线是：

情景1：
scenario1

场景2：GBC SMOTE功能选择
enter image description here

场景3：GBC SMOTE特征选择规范化
enter image description here

1 回答

1

所以，你的第一条曲线是有道理的 . 当您增加训练点时，您希望测试错误降低 . 当你有一个没有最大深度和100％最大样本的随机森林树木时，你会期望接近0列车误差 . 你可能过于适合，但是你可能不会用RandomForests（或者，取决于数据集，其他任何东西）变得更好 .

你的第二条曲线没有意义 . 你应该再次得到接近0的火车错误，除非发生一些完全不稳定的事情（就像一个真正破坏的输入集） . 我看不出你的代码有什么问题，我跑了你的功能;似乎工作正常 . 如果没有用代码发布完整的工作示例，我无能为力 .

回复于 2024-05-03T11:27:26+08:00

学习曲线显示过度拟合吗？

1 回答

相关问题