通过使用实例上的分类器的置信水平来改善预测分数-Java 学习之路

我正在使用三个分类器（ RandomForestClassifier ， KNearestNeighborClassifier 和 SVM Classifier ），您可以在下面看到：

>> svm_clf_sl_GS
SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=41, shrinking=True,
  tol=0.001, verbose=False)

>> knn_clf_sl_GS
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='distance')

>> for_clf_sl_GS
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

在训练期间， RandomForestClassifer 给出了最好的 f1_score ，然后是 KNearestNeighborClassifier ，然后是 SVMClassifier 对数据的预测 . 这是我的X_train（标准缩放值，如果需要，你可以问我是怎么得到的）＆y_train：

>> X_train
array([[-0.11034393, -0.72380296,  0.15254572, ...,  0.4166148 ,
        -0.91095473, -0.91095295],
       [ 1.6817184 ,  0.40040944, -0.6770607 , ..., -0.2403781 ,
         0.02962478,  0.02962424],
       [ 1.01128052, -0.21062032, -0.2460462 , ..., -0.04817728,
        -0.15848331, -0.15847739],
       ..., 
       [-1.18666853,  0.87297522,  0.47136779, ..., -0.19599824,
         0.72417473,  0.72416714],
       [ 1.6835304 ,  0.40605067, -0.63383059, ..., -0.37094083,
         0.09505496,  0.09505389],
       [ 0.19950709, -1.04624152, -0.18351693, ...,  0.4362658 ,
        -0.77994791, -0.77994176]])

>> y_train_sl
874     0
1863    0
1493    0
288     1
260     0
495     0
1529    0
1704    1
75      1
1792    0
626     0
99      1
222     0
774     0
52      1
1688    1
1770    0
53      1
1814    0
488     0
230     0
481     0
132     1
831     0
1166    1
1593    0
771     0
1785    0
616     0
207     0
       ..
155     1
1506    0
719     0
547     0
613     0
652     0
1351    0
304     0
1689    1
1693    1
1128    0
1323    0
763     0
701     0
467     0
917     0
329     0
375     0
1721    0
928     0
1784    0
1200    0
832     0
986     0
1687    1
643     0
802     0
280     1
1864    0
1045    0
Name: Type of Formation_shaly limestone, Length: 1390, dtype: uint8

正如您所看到的，我的y_train是布尔形式的（即实例为 True ，其中 False .

我希望通过使用 predict_proba 来进一步提高预测的准确性，当我看到来自分类器的预测（假设首先是 RandomForestClassifier ）对于它预测的特定实例具有低置信度（<60％）（其中是我应该首先找到的），它移动到下一个分类器（比方说 KNearestNeighborClassifier ）并检查这些实例的下一个分类器对这些实例的置信度，如果它与前一个分类器相比具有高置信度（ > 60％）接受来自该分类器的解决方案，类似地，如果此分类器对相同实例的置信度仍然较低（<60％），则移至下一个分类器并对第三个分类器执行相同的操作 .

最后，如果第三个分类器的置信度水平（<60％）也较低，我需要接受来自分类器的解决方案，该解决方案在所有三个分类器中具有最高置信度 .

因为，我是机器学习的新手，我可能会对你道歉的一些陈述感到困惑，所以请纠正我错在哪里 .

EDIT: X_test和y_test如下所示 . 我需要预测X_test_prepared并使用 f1_score 评估预测和y_test_sl . 预测的y必须通过所有三个分类器，并且对所有实例具有最佳置信度 .

>> X_test_prepared
array([[ 0.69961751, -0.11156033, -0.43852312, ..., -0.40967982,
         0.32099948,  0.32099952],
       [ 0.90256086, -0.54532856, -0.46399801, ..., -0.05752097,
        -0.54261829, -0.54261947],
       [ 1.67447042,  0.24530384, -1.0113221 , ..., -0.54844942,
        -0.26066608, -0.26066032],
       ...,
       [ 0.28104683,  1.52670909,  0.62653301, ..., -1.15596295,
         2.05859487,  2.05859247],
       [ 1.50595496,  0.84507934, -0.44109634, ..., -0.71277072,
         0.14474518,  0.14474398],
       [-1.63423112, -0.12690448,  0.48577783, ..., -0.36025459,
         0.29137477,  0.29137047]])

>> y_test_sl
1321    0
1433    0
1859    0
1496    0
492     0
736     0
996     0
1001    0
634     0
1486    0
910     0
1579    0
373     0
1750    0
1563    0
1584    0
51      1
349     0
1162    1
594     0
1121    0
1637    0
1116    0
106     1
1533    0
993     0
960     0
277     0
142     1
1010    0
       ..
1104    1
1404    0
1646    0
1009    0
61      1
444     0
10      1
704     0
744     0
418     0
998     0
740     0
465     0
97      1
1550    1
1738    0
978     0
690     0
1071    0
1228    1
1539    0
145     1
1015    0
1371    0
1758    0
315     0
71      1
1090    0
1766    0
33      1
Name: Type of Formation_shaly limestone, Length: 515, dtype: uint8

1 回答

这里的目标是创建一个分类器集合，并采用所有分类器的最“自信”（最高概率类）预测 . 代码如下：

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import numpy as np
from sklearn.datasets import make_classification

X_train, y_train = make_classification(n_features=4) # Put your training data here instead

# parameters for random forest
rfclf_params = {
    'bootstrap': True, 
    'class_weight':None, 
    'criterion':'entropy',
    'max_depth':None, 
    'max_features':'auto', 
    # ... fill in the rest you want here
}

# Fill in svm params here
svm_params = {
    'probability':True
}

# KNeighbors params go here
kneighbors_params = {

}

params = [rfclf_params, svm_params, kneighbors_params]
classifiers = [RandomForestClassifier, SVC, KNeighborsClassifier]

def ensemble(classifiers, params, X_train, y_train, X_test):
    best_preds = np.zeros((len(X_test), 2))
    classes = np.unique(y_train)

    for i in range(len(classifiers)):
        # Construct the classifier by unpacking params 
        # store classifier instance
        clf = classifiers[i](**params[i])
        # Fit the classifier as usual and call predict_proba
        clf.fit(X_train, y_train)
        y_preds = clf.predict_proba(X_test)
        # Take maximum probability for each class on each classifier 
        # This is done for every instance in X_test
        # see the docs of np.maximum here: 
        # https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.maximum.html
        best_preds = np.maximum(best_preds, y_preds)

    # map the maximum probability for each instance back to its corresponding class
    preds = np.array([classes[np.argmax(pred)] for pred in best_preds])
    return preds

# Test your predictions  
from sklearn.metrics import accuracy_score, f1_score
y_preds = ensemble(classifiers, params, X_train, y_train, X_train)
print(accuracy_score(y_train, y_preds), f1_score(y_train, y_preds))

如果您希望算法返回最高概率而不是预测类，请 ensemble 返回 [np.amax(pred_probs) for pred_probs in best_preds] 而不是preds .

回复于 2024-05-09T12:21:10+08:00

通过使用实例上的分类器的置信水平来改善预测分数

1 回答

相关问题