Scikit-Learn随机森林分类器：培训和测试的准确性高，但不是生产环境-Java 学习之路

我正在训练一个分类器来预测哪个将基于文本的请求分类到部门 . 我有~107,000个由22个不 balancer 类组成的标记示例，大致有以下分布：

1级：10,000
2级：60,000
3级：7,000
4级：5,000
5级：3,500
第6和7类：每个样本2000个
类别7-15：每个1500个样本
每个类别16-22：500个样本

我一直在预处理数据，以提供偶数个样本（每个类有5,000个样本到50,000个样本） . 其中上述分类器和 balancer 训练数据，我能够在测试数据上获得高达98.5％的准确度，并且总训练数据分为50-50 . 但是当新请求进入并加载分类器时，分类器最多只能达到50-70％的准确率 . 样本相对稳定，相同的请求总是发送到同一个部门，所以我非常惊讶，只有50-70％准确，特别是在测试数据上具有如此高的准确度：

import logging
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.externals import joblib
from sklearn.metrics import classification_report

logger = logging.getLogger(__name__)

def up_sample(data, labels, **kwargs):
    label_counts = Counter(labels)
    max_label = max(label_counts, key=label_counts.get)
    max_label_count = kwargs.get('samples', label_counts[max_label])
    output_text = []
    output_labels = []
    for label, count in label_counts.items():
        label_text = [data_row for data_row, label_row in zip(data, labels) if label_row == label]
        resampled_labels = [label] * max_label_count
        resampled_text = resample(label_text, n_samples=max_label_count, random_state=0)
        output_text = output_text + resampled_text
        output_labels = output_labels + resampled_labels
    return output_text, output_labels


clf = Pipeline(
    steps=(('tfidf_vectorizer', TfidfVectorizer(stop_words='english')),
    ('clf', RandomForestClassifier(n_estimators=250, n_jobs=-1)))
)

resampled_data, resampled_labels = upsample(data, labels) # UPDATE:  produces ~700,000 samples, which many duplicates

labels = label_encoder.fit_transform(labels)

X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.5, random_state=0) # UPDATE: many duplicates in both training and test data sets as a result of upsampling

clf.fit(X_train, y_train)

test_score = clf.score(X_test, y_test)
logger.debug('Test Score: %s', test_score) # 0.98-0.99%

cross_validation_results = cross_val_score(clf, data, labels)
logger.debug('Cross Validation results: %r', cross_validation_results) # [98.7, 99.1, 97.8]

y_test_predicted = clf.predict(X_test)
output_classification_report = classification_report(y_test, y_test_predicted, target_names=label_encoder.classes_)
logger.debug(output_classification_report)  # 0.95-1.0 for precision and recall for all classes

clf_file_name = os.path.join(directory, clf_name)
joblib.dump(clf, clf_file_name)

label_encoder_file_name = os.path.join(directory, label_encoder_name)
joblib.dump(label_encoder, label_encoder_file_name)

# Later, in a different script
clf_file_name = os.path.join(directory, name)
clf = joblib.load(clf_file_name)

label_encoder_file_name = os.path.join(directory, name)
label_encoder = joblib.load(label_encoder_file_name)

predictions = clf.predict(new_data)
logger.debug(clf.score(new_labels, predictions)) # 50-70%

此外，当我使用new_data重新训练分类器并在new_data上进行预测时，它是100％准确的 . 我知道它会得到更高的分数，因为它已经看到了这个例子，但我一直在阅读随机森林中的袋外错误，我知道这可能是我的问题，但我对OOB不太熟悉，不知道如何纠正这个 . 我不知道怎么从这里开始 . 我该如何解决这个问题？

在发布我自己的问题之前，我已经阅读了以下问题/资源来解决我的问题，但如果我忽略了他们的某些内容，请随时告诉我：

Scikit-Learn随机森林分类器：培训和测试的准确性高，但不是 生产环境

相关问题

Scikit-Learn随机森林分类器：培训和测试的准确性高，但不是生产环境