首页 文章

在scikit-learn中实现R随机森林特征重要性得分

提问于
浏览
1

我正在尝试为sklearn中的随机森林回归模型实现R的特征重要性评分方法;根据R的文档:

第一个度量是根据置换OOB数据计算的:对于每个树,记录数据的袋外部分的预测误差(分类的错误率,回归的MSE) . 然后在置换每个预测变量之后完成相同的操作 . 然后将两者之间的差异在所有树上进行平均,并通过差异的标准偏差进行归一化 . 如果变量的差异的标准偏差等于0,则不进行除法(但在这种情况下平均值几乎总是等于0) .

因此,如果我理解正确,我需要能够为每个树中的OOB样本置换每个预测变量(特征) .

我知道我可以通过这样的方式访问训练有素的森林中的每棵树

numberTrees = 100
clf = RandomForestRegressor(n_estimators=numberTrees)
clf.fit(X,Y)
for tree in clf.estimators_:
    do something

无论如何得到每棵树的OOB样本列表?也许我可以将每棵树的 random_state 推导出OOB样本列表?

1 回答

  • 2

    虽然R使用OOB样本,但我发现通过使用所有训练样本,我在scikit中得到了类似的结果 . 我正在做以下事情:

    # permute training data and score against its own model  
    epoch = 3
    seeds = range(epoch)
    
    
    scores = defaultdict(list) # {feature: change in R^2}
    
    # repeat process several times and then average and then average the score for each feature
    for j in xrange(epoch):
        clf = RandomForestRegressor(n_jobs = -1, n_estimators = trees, random_state = seeds[j],
                                   max_features = num_features, min_samples_leaf = leaf)
    
        clf = clf.fit(X_train, y_train)
        acc = clf.score(X_train, y_train)    
    
        print 'Epoch', j
        # for each feature, permute its values and check the resulting score
        for i, col in enumerate(X_train.columns):
            if i % 200 == 0: print "- feature %s of %s permuted" %(i, X_train.shape[1])
            X_train_copy = X_train.copy()
            X_train_copy[col] = np.random.permutation(X_train[col])
            shuff_acc = clf.score(X_train_copy, y_train)
            scores[col].append((acc-shuff_acc)/acc)
    
    # get mean across epochs
    scores_mean = {k: np.mean(v) for k, v in scores.iteritems()}
    
    # sort scores (best first)
    scores_sorted = pd.DataFrame.from_dict(scores_mean, orient='index').sort(0, ascending = False)
    

相关问题