首页 文章

挑选训练有素的分类器会产生与直接从新近但训练相同的分类器获得的结果不同的结果

提问于
浏览
1

我正在尝试从Scikit-learn库中挑选一个经过训练的SVM分类器,这样我就不必一遍又一遍地训练它 . 但是当我将测试数据传递给从pickle加载的分类器时,我得到了异常高的准确度,f测量值等 . 如果测试数据直接传递给未被pickle的分类器,则它会给出更低的值 . 我不明白为什么pickling和unpickling分类器对象正在改变它的行为方式 . 有人可以帮我解决这个问题吗?

我正在做这样的事情:

from sklearn.externals import joblib
joblib.dump(grid, 'grid_trained.pkl')

这里, grid 是训练有素的分类器对象 . 当我取消它时,它与直接使用时的行为非常不同 .

1 回答

  • -1

    @AndreasMueller说不应该有任何区别,这是http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#loading-the-20-newgroups-dataset使用 pickle 的修改示例:

    from sklearn.datasets import fetch_20newsgroups
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.naive_bayes import MultinomialNB
    
    # Set labels and data
    categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
    twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
    
    # Vectorize data
    count_vect = CountVectorizer()
    X_train_counts = count_vect.fit_transform(twenty_train.data)
    
    # TF-IDF transformation
    tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
    X_train_tf = tf_transformer.transform(X_train_counts)
    tfidf_transformer = TfidfTransformer()
    X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
    
    # Train classifier
    clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
    
    # Tag new data
    docs_new = ['God is love', 'OpenGL on the GPU is fast']
    X_new_counts = count_vect.transform(docs_new)
    X_new_tfidf = tfidf_transformer.transform(X_new_counts)
    predicted = clf.predict(X_new_tfidf)
    
    answers = [(doc, twenty_train.target_names[category]) for doc, category in zip(docs_new, predicted)]
    
    
    # Pickle the classifier
    import pickle
    with open('clf.pk', 'wb') as fout:
        pickle.dump(clf, fout)
    
    # Let's clear the classifier
    clf = None
    
    with open('clf.pk', 'rb') as fin:
        clf = pickle.load(fin)
    
    # Retag new data
    docs_new = ['God is love', 'OpenGL on the GPU is fast']
    X_new_counts = count_vect.transform(docs_new)
    X_new_tfidf = tfidf_transformer.transform(X_new_counts)
    predicted = clf.predict(X_new_tfidf)
    
    answers_from_loaded_clf = [(doc, twenty_train.target_names[category]) for doc, category in zip(docs_new, predicted)]
    
    assert answers_from_loaded_clf == answers
    print "Answers from freshly trained classifier and loaded pre-trained classifer are the same !!!"
    

    使用_2462688时也是如此:

    # Pickle the classifier
    from sklearn.externals import joblib
    joblib.dump(clf, 'clf.pk')
    
    # Let's clear the classifier
    clf = None
    
    # Loads the pretrained classifier
    clf = joblib.load('clf.pk')
    
    # Retag new data
    docs_new = ['God is love', 'OpenGL on the GPU is fast']
    X_new_counts = count_vect.transform(docs_new)
    X_new_tfidf = tfidf_transformer.transform(X_new_counts)
    predicted = clf.predict(X_new_tfidf)
    
    answers_from_loaded_clf = [(doc, twenty_train.target_names[category]) for doc, category in zip(docs_new, predicted)]
    
    assert answers_from_loaded_clf == answers
    print "Answers from freshly trained classifier and loaded pre-trained classifer are the same !!!"
    

相关问题