当你有大量的类时，为什么xgboost这么慢？-Java 学习之路

我有一个维度 (40000, 21) 的稀疏数据集 . 我正在尝试使用 xgboost 为它构建分类模型 . 不幸的是它太慢了它永远不会终止我 . 但是，在相同的数据集上，scikit-learn的RandomForestClassifer大约需要1秒钟 . 这是我正在使用的代码：

from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
[...]
t0 = time()
rf = RandomForestClassifier(n_jobs=-1)
rf.fit(trainX, trainY)
print("RF score", rf.score(testX, testY))
print("Time to fit and score random forest", time()-t0)

t0 = time()
clf = XGBClassifier(n_jobs=-1)
clf.fit(trainX, trainY, verbose=True)
print(clf.score(testX, testY))
print("Time taken to fit and score xgboost", time()-t0)

要显示trainX的类型：

print(repr(trainX))    
<40000x21 sparse matrix of type '<class 'numpy.int64'>'
    with 360000 stored elements in Compressed Sparse Row format>

注意我使用除n_jobs之外的所有默认参数 .

我做错了什么？

In [3]: print(xgboost.__version__)
0.6
print(sklearn.__version__)
0.19.1

到目前为止，我在评论中的建议中尝试了以下内容：

我设置 n_enumerators = 5 . 现在至少它在62秒内完成 . 这仍然比RandomForestClassifier慢约60倍 .
使用 n_enumerators = 5 我删除了 n_jobs=-1 并设置了 n_jobs=1 . 然后它在大约107秒内完成（比RandomForestClassifier慢大约100倍） . 如果我将 n_jobs 增加到4，则速度可达27秒 . 比RandomForestClassifier慢约27倍 .
如果我保留默认的估算数量，它仍然永远不会完成 .

这是使用虚假数据重现问题的完整代码 . 我为两个分类器设置了n_estimators = 50，它将RandomForestClassifier减慢到大约16秒 . 另一方面，Xgboost仍然永远不会终止 .

#!/usr/bin/python3

from sklearn.datasets import make_classification
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from time import time

(trainX, trainY) = make_classification(n_informative=10, n_redundant=0, n_samples=50000, n_classes=120)

print("Shape of trainX and trainY", trainX.shape, trainY.shape)
t0 = time()
rf = RandomForestClassifier(n_estimators=50, n_jobs=-1)
rf.fit(trainX, trainY)
print("Time elapsed by RandomForestClassifier is: ", time()-t0)
t0 = time()
xgbrf = XGBClassifier(n_estimators=50, n_jobs=-1,verbose=True)
xgbrf.fit(trainX, trainY)
print("Time elapsed by XGBClassifier is: ", time()-t0)

1 回答

1

事实证明，xgboost的运行时间与类的数量成比例地缩放 . 见https://github.com/dmlc/xgboost/issues/2926 .

回复于 2024-05-06T12:02:57+08:00

当你有大量的类时，为什么xgboost这么慢？

1 回答

相关问题