首页 文章

在sklearn cross_val_score上评估多个分数

提问于
浏览
21

我正在尝试使用sklearn评估多个机器学习算法,以获得几个指标(准确度,召回率,精度等等) .

对于我从文档here和源代码(我使用sklearn 0.17)中理解的内容,cross_val_score函数每次执行只接收一个得分手 . 因此,为了计算多个分数,我必须:

  • 多次执行

  • 实施我的(耗时且容易出错的)得分手

我用这段代码执行了多次:

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import  cross_val_score
import time
from sklearn.datasets import  load_iris

iris = load_iris()

models = [GaussianNB(), DecisionTreeClassifier(), SVC()]
names = ["Naive Bayes", "Decision Tree", "SVM"]
for model, name in zip(models, names):
    print name
    start = time.time()
    for score in ["accuracy", "precision", "recall"]:
        print score,
        print " : ",
        print cross_val_score(model, iris.data, iris.target,scoring=score, cv=10).mean()
    print time.time() - start

我得到这个输出:

Naive Bayes
accuracy  :  0.953333333333
precision  :  0.962698412698
recall  :  0.953333333333
0.0383198261261
Decision Tree
accuracy  :  0.953333333333
precision  :  0.958888888889
recall  :  0.953333333333
0.0494720935822
SVM
accuracy  :  0.98
precision  :  0.983333333333
recall  :  0.98
0.063080072403

这没关系,但对我自己的数据来说速度很慢 . 我如何衡量所有分数?

2 回答

  • 20

    我遇到了同样的问题,我创建了一个可以在 cross_val_score 中支持多个指标的模块 .
    为了用这个模块完成你想要的,你可以写:

    from multiscorer import MultiScorer
    import numpy as np
    from sklearn.metrics import accuracy_score, precision_score, recall_score          
    from sklearn.model_selection import cross_val_score
    from numpy import average
    
    scorer = MultiScorer({
        'Accuracy' : (accuracy_score, {}),
        'Precision' : (precision_score, {'pos_label': 3, 'average':'macro'}),
        'Recall' : (recall_score, {'pos_label': 3, 'average':'macro'})
    })
    
    for model, name in zip(models, names):
        print name
        start = time.time()
    
        cross_val_score(model, iris.data, iris.target,scoring=scorer, cv=10)
        results = scorer.get_results()
    
        for metric_name in results.keys():
            average_score = np.average(results[metric_name])
            print('%s : %f' % (metric_name, average_score))
    
        print 'time', time.time() - start, '\n\n'
    

    您可以从GitHub检查并下载此模块 . 希望能帮助到你 .

  • 11

    Since the time of writing this post scikit-learn has updated and made my answer obsolete, see the much cleaner solution below


    您可以编写自己的评分函数来捕获所有三条信息,但是交叉验证的评分函数只能在scikit-learn中返回一个数字(这可能是出于兼容性原因) . 下面是一个示例,其中每个交叉验证切片的每个分数都打印到控制台,返回的值只是三个指标的总和 . 如果要返回所有这些值,则必须对cross_val_score(cross_validation.py的第1351行)和_score(第1601行或同一文件)进行一些更改 .

    from sklearn.svm import SVC
    from sklearn.naive_bayes import GaussianNB
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.cross_validation import  cross_val_score
    import time
    from sklearn.datasets import  load_iris
    from sklearn.metrics import accuracy_score, precision_score, recall_score
    
    iris = load_iris()
    
    models = [GaussianNB(), DecisionTreeClassifier(), SVC()]
    names = ["Naive Bayes", "Decision Tree", "SVM"]
    
    def getScores(estimator, x, y):
        yPred = estimator.predict(x)
        return (accuracy_score(y, yPred), 
                precision_score(y, yPred, pos_label=3, average='macro'), 
                recall_score(y, yPred, pos_label=3, average='macro'))
    
    def my_scorer(estimator, x, y):
        a, p, r = getScores(estimator, x, y)
        print a, p, r
        return a+p+r
    
    for model, name in zip(models, names):
        print name
        start = time.time()
        m = cross_val_score(model, iris.data, iris.target,scoring=my_scorer, cv=10).mean()
        print '\nSum:',m, '\n\n'
        print 'time', time.time() - start, '\n\n'
    

    这使:

    Naive Bayes
    0.933333333333 0.944444444444 0.933333333333
    0.933333333333 0.944444444444 0.933333333333
    1.0 1.0 1.0
    0.933333333333 0.944444444444 0.933333333333
    0.933333333333 0.944444444444 0.933333333333
    0.933333333333 0.944444444444 0.933333333333
    0.866666666667 0.904761904762 0.866666666667
    1.0 1.0 1.0
    1.0 1.0 1.0
    1.0 1.0 1.0
    
    Sum: 2.86936507937 
    
    
    time 0.0249638557434 
    
    
    Decision Tree
    1.0 1.0 1.0
    0.933333333333 0.944444444444 0.933333333333
    1.0 1.0 1.0
    0.933333333333 0.944444444444 0.933333333333
    0.933333333333 0.944444444444 0.933333333333
    0.866666666667 0.866666666667 0.866666666667
    0.933333333333 0.944444444444 0.933333333333
    0.933333333333 0.944444444444 0.933333333333
    1.0 1.0 1.0
    1.0 1.0 1.0
    
    Sum: 2.86555555556 
    
    
    time 0.0237860679626 
    
    
    SVM
    1.0 1.0 1.0
    0.933333333333 0.944444444444 0.933333333333
    1.0 1.0 1.0
    1.0 1.0 1.0
    1.0 1.0 1.0
    0.933333333333 0.944444444444 0.933333333333
    0.933333333333 0.944444444444 0.933333333333
    1.0 1.0 1.0
    1.0 1.0 1.0
    1.0 1.0 1.0
    
    Sum: 2.94333333333 
    
    
    time 0.043044090271
    

    从scikit-learn 0.19.0开始,解决方案变得更容易了

    from sklearn.model_selection import cross_validate
    from sklearn.datasets import  load_iris
    from sklearn.svm import SVC
    
    iris = load_iris()
    clf = SVC()
    scoring = {'acc': 'accuracy',
               'prec_macro': 'precision_macro',
               'rec_micro': 'recall_macro'}
    scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
                             cv=5, return_train_score=True)
    print(scores.keys())
    print(scores['test_acc'])
    

    这使:

    ['test_acc', 'score_time', 'train_acc', 'fit_time', 'test_rec_micro', 'train_rec_micro', 'train_prec_macro', 'test_prec_macro']
    [ 0.96666667  1.          0.96666667  0.96666667  1.        ]
    

相关问题