首页 文章

Decission Tree方包预测错误 - 级别不匹配

提问于
浏览
1

我正在使用party包在R中构建CART回归树模型,但是当我尝试将模型应用于测试数据集时,我收到错误消息,指出级别不匹配 .

过去一周我一直在阅读论坛上的主题,但仍无法找到解决问题的正确方法 . 所以我在这里使用我编写的假例子重新发布这个问题 . 有人可以帮助解释错误信息并提供解决方案吗?

我的训练数据集有大约1000条记录,测试数据集大约有150条 . 两个数据集中都没有NA或空白字段 .

在派对包下使用ctree的我的CART模型是:

mytree < - ctree(Rate~Bank Product Salary,data = data_train)

data_train示例:

Rate  Bank  Product  Salary    
1.5    A     aaa     100000
0.6    B     abc      60000
3      C     bac      10000
2.1    D     cba      50000
1.1    E     cca      80000

data_test示例:

Rate  Bank  Product   Salary
2.0    A     cba       80000
0.5    D     cca      250000
0.8    E     cba      120000
2.1    C     abc       65000

levels(data_train$Bank) : A, B, C, D, E

levels(data_test$Bank): A,D,E,C

我尝试使用以下代码设置为相同级别:

>is.factor(data_test$Bank)

 TRUE 
(Made sure Bank and Products are factors in both datasets)
>levels(data_test$Bank) <-union(levels(data_test$Bank), levels(data_train$Bank))

> levels(data_test$product)<-union(levels(data_test$product),levels(data_train$product))

但是,当我尝试在测试数据集上运行预测时,我收到以下错误:

> fit1<- predict(mytree,newdata=data_test)

Error in checkData(oldData, RET) : 
  Levels in factors of new data do not match original data

我也尝试了以下方法,但它改变了我的测试数据集的字段...:

levels(data_test $ Bank)<-levels(data_train $ Bank)

data_test表被更改:

Rate  Bank(altered)  Bank (original)   
2.0    A              A      
0.5    B              D      
0.8    C              E      
2.1    D              C

1 回答

  • 1

    您可以尝试使用可比较的级别重建因子,而不是为现有因子分配新级别 . 这是一个例子:

    # start the party
    library(party)
    
    # create training data sample
    data_train <- data.frame(Rate = c(1.5, 0.6, 3, 2.1, 1.1),
                             Bank = c("A", "B", "C", "D", "E"),
                             Product = c("aaa", "abc", "bac", "cba", "cca"),
                             Salary = c(100000, 60000, 10000, 50000, 80000))
    
    # create testing data sample
    data_test <- data.frame(Rate = c(2.0, 0.5, 0.8, 2.1),
                             Bank = c("A", "D", "E", "C"),
                             Product = c("cba", "cca", "cba", "abc"),
                             Salary = c(80000, 250000, 120000, 65000))
    
    # get the union of levels between train and test for Bank and Product
    bank_levels <- union(levels(data_test$Bank), levels(data_train$Bank))
    product_levels <- union(levels(data_test$Product), levels(data_train$Product))
    
    # rebuild Bank with union of levels
    data_test$Bank <- with(data_test, factor(Bank, levels = bank_levels)) 
    data_train$Bank <- with(data_train, factor(Bank, levels = bank_levels)) 
    
    # rebuild Product with union of levels
    data_test$Product <- with(data_test, factor(Product, levels = product_levels)) 
    data_train$Product <- with(data_train, factor(Product, levels = product_levels)) 
    
    # fit the model
    mytree <- ctree(Rate ~ Bank + Product + Salary, data = data_train)
    
    # generate predictions
    fit1 <- predict(mytree, newdata = data_test)
    
    > fit1
         Rate
    [1,] 1.66
    [2,] 1.66
    [3,] 1.66
    [4,] 1.66
    

相关问题