首页 文章

通过使用实例上的分类器的置信水平来改善预测分数

提问于
浏览
0

我正在使用三个分类器( RandomForestClassifierKNearestNeighborClassifierSVM Classifier ),您可以在下面看到:

>> svm_clf_sl_GS
SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=41, shrinking=True,
  tol=0.001, verbose=False)

>> knn_clf_sl_GS
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='distance')

>> for_clf_sl_GS
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

在训练期间, RandomForestClassifer 给出了最好的 f1_score ,然后是 KNearestNeighborClassifier ,然后是 SVMClassifier 对数据的预测 . 这是我的X_train(标准缩放值,如果需要,你可以问我是怎么得到的)&y_train:

>> X_train
array([[-0.11034393, -0.72380296,  0.15254572, ...,  0.4166148 ,
        -0.91095473, -0.91095295],
       [ 1.6817184 ,  0.40040944, -0.6770607 , ..., -0.2403781 ,
         0.02962478,  0.02962424],
       [ 1.01128052, -0.21062032, -0.2460462 , ..., -0.04817728,
        -0.15848331, -0.15847739],
       ..., 
       [-1.18666853,  0.87297522,  0.47136779, ..., -0.19599824,
         0.72417473,  0.72416714],
       [ 1.6835304 ,  0.40605067, -0.63383059, ..., -0.37094083,
         0.09505496,  0.09505389],
       [ 0.19950709, -1.04624152, -0.18351693, ...,  0.4362658 ,
        -0.77994791, -0.77994176]])

>> y_train_sl
874     0
1863    0
1493    0
288     1
260     0
495     0
1529    0
1704    1
75      1
1792    0
626     0
99      1
222     0
774     0
52      1
1688    1
1770    0
53      1
1814    0
488     0
230     0
481     0
132     1
831     0
1166    1
1593    0
771     0
1785    0
616     0
207     0
       ..
155     1
1506    0
719     0
547     0
613     0
652     0
1351    0
304     0
1689    1
1693    1
1128    0
1323    0
763     0
701     0
467     0
917     0
329     0
375     0
1721    0
928     0
1784    0
1200    0
832     0
986     0
1687    1
643     0
802     0
280     1
1864    0
1045    0
Name: Type of Formation_shaly limestone, Length: 1390, dtype: uint8

正如您所看到的,我的y_train是布尔形式的(即实例为 True ,其中 False .

我希望通过使用 predict_proba 来进一步提高预测的准确性,当我看到来自分类器的预测(假设首先是 RandomForestClassifier )对于它预测的特定实例具有低置信度(<60%)(其中是我应该首先找到的),它移动到下一个分类器(比方说 KNearestNeighborClassifier )并检查这些实例的下一个分类器对这些实例的置信度,如果它与前一个分类器相比具有高置信度( > 60%)接受来自该分类器的解决方案,类似地,如果此分类器对相同实例的置信度仍然较低(<60%),则移至下一个分类器并对第三个分类器执行相同的操作 .

最后,如果第三个分类器的置信度水平(<60%)也较低,我需要接受来自分类器的解决方案,该解决方案在所有三个分类器中具有最高置信度 .

因为,我是机器学习的新手,我可能会对你道歉的一些陈述感到困惑,所以请纠正我错在哪里 .

EDIT: X_test和y_test如下所示 . 我需要预测X_test_prepared并使用 f1_score 评估预测和y_test_sl . 预测的y必须通过所有三个分类器,并且对所有实例具有最佳置信度 .

>> X_test_prepared
array([[ 0.69961751, -0.11156033, -0.43852312, ..., -0.40967982,
         0.32099948,  0.32099952],
       [ 0.90256086, -0.54532856, -0.46399801, ..., -0.05752097,
        -0.54261829, -0.54261947],
       [ 1.67447042,  0.24530384, -1.0113221 , ..., -0.54844942,
        -0.26066608, -0.26066032],
       ...,
       [ 0.28104683,  1.52670909,  0.62653301, ..., -1.15596295,
         2.05859487,  2.05859247],
       [ 1.50595496,  0.84507934, -0.44109634, ..., -0.71277072,
         0.14474518,  0.14474398],
       [-1.63423112, -0.12690448,  0.48577783, ..., -0.36025459,
         0.29137477,  0.29137047]])

>> y_test_sl
1321    0
1433    0
1859    0
1496    0
492     0
736     0
996     0
1001    0
634     0
1486    0
910     0
1579    0
373     0
1750    0
1563    0
1584    0
51      1
349     0
1162    1
594     0
1121    0
1637    0
1116    0
106     1
1533    0
993     0
960     0
277     0
142     1
1010    0
       ..
1104    1
1404    0
1646    0
1009    0
61      1
444     0
10      1
704     0
744     0
418     0
998     0
740     0
465     0
97      1
1550    1
1738    0
978     0
690     0
1071    0
1228    1
1539    0
145     1
1015    0
1371    0
1758    0
315     0
71      1
1090    0
1766    0
33      1
Name: Type of Formation_shaly limestone, Length: 515, dtype: uint8

1 回答

  • 0

    这里的目标是创建一个分类器集合,并采用所有分类器的最“自信”(最高概率类)预测 . 代码如下:

    from sklearn.ensemble import RandomForestClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.svm import SVC
    import numpy as np
    from sklearn.datasets import make_classification
    
    X_train, y_train = make_classification(n_features=4) # Put your training data here instead
    
    # parameters for random forest
    rfclf_params = {
        'bootstrap': True, 
        'class_weight':None, 
        'criterion':'entropy',
        'max_depth':None, 
        'max_features':'auto', 
        # ... fill in the rest you want here
    }
    
    # Fill in svm params here
    svm_params = {
        'probability':True
    }
    
    # KNeighbors params go here
    kneighbors_params = {
    
    }
    
    params = [rfclf_params, svm_params, kneighbors_params]
    classifiers = [RandomForestClassifier, SVC, KNeighborsClassifier]
    
    def ensemble(classifiers, params, X_train, y_train, X_test):
        best_preds = np.zeros((len(X_test), 2))
        classes = np.unique(y_train)
    
        for i in range(len(classifiers)):
            # Construct the classifier by unpacking params 
            # store classifier instance
            clf = classifiers[i](**params[i])
            # Fit the classifier as usual and call predict_proba
            clf.fit(X_train, y_train)
            y_preds = clf.predict_proba(X_test)
            # Take maximum probability for each class on each classifier 
            # This is done for every instance in X_test
            # see the docs of np.maximum here: 
            # https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.maximum.html
            best_preds = np.maximum(best_preds, y_preds)
    
        # map the maximum probability for each instance back to its corresponding class
        preds = np.array([classes[np.argmax(pred)] for pred in best_preds])
        return preds
    
    # Test your predictions  
    from sklearn.metrics import accuracy_score, f1_score
    y_preds = ensemble(classifiers, params, X_train, y_train, X_train)
    print(accuracy_score(y_train, y_preds), f1_score(y_train, y_preds))
    

    如果您希望算法返回最高概率而不是预测类,请 ensemble 返回 [np.amax(pred_probs) for pred_probs in best_preds] 而不是preds .

相关问题