我需要使用RWeka的实现( J48()
)在我的流失数据集上 optimize the accuracy of the C4.5 algorithm . 因此,我使用插入符号包的 train()
函数来帮助我确定最佳参数设置(对于M和C) . 我尝试通过手动运行 J48()
并使用 train()
确定的参数来验证结果 . 结果令人惊讶,因为 the manual run had a much better result .
提出以下问题:
-
手动执行
J48()
时哪些参数可能不同? -
如何使用
train()
函数提供与手动参数设置相似或更好的结果? -
或者我完全错过了什么?
我正在运行以下代码:
library("RWeka", lib.loc="~/R/win-library/3.3")
library("caret", lib.loc="~/R/win-library/3.3")
library("gmodels", lib.loc="~/R/win-library/3.3")
set.seed(7331)
使用来自封装插入符的train()确定使用J48的最佳C4.5模型:
ctrl <- trainControl(method="LGOCV", p=0.8, seeds=NA)
grid <- expand.grid(.M=25*(1:15), .C=c(0.1,0.05,0.025,0.01,0.0075,0.005))
使用完整数据集“response_nochar”训练模型:
rtrain <- train(churn~.,data=response_nochar,method="J48",na.action=na.pass,trControl=ctrl,tuneGrid=grid)
返回具有预测精度0.6055的rtrain $ finalmodel(以及具有2个叶子的大小为3的树):
# Accuracy was used to select the optimal model using the largest value.
# The final values used for the model were C = 0.005 and M = 25.
有约 . 精确到0.6055精度的50种组合,范围从最终模型的给定值到(M = 325,C = 0.1)(中间有一个例外) .
使用J48手动尝试参数值:
# splitting into training and test datasets, deriving from full dataset "response_nochar"
# similar/equal to the above splitting with LGOCV and p=0.8?
response_sample <- sample(10000, 8000)
response_train <- response_nochar[response_sample,]
response_test <- response_nochar[-response_sample,]
# setting parameters
jctrl <- Weka_control(M=25,C=0.005)
计算模型:
c45 <- J48(churn~.,data=response_train,na.action=na.pass,control=jctrl)
使用测试数据集进行预测:
pred_c45 <- predict(c45, newdata=response_test, na.action=na.pass)
模型预测精度为0.655(树木大小为25,有13片叶子) .
CrossTable(response_test$churn, pred_c45, prop.chisq= FALSE, prop.c= FALSE, prop.r= FALSE, dnn= c('actual churn','predicted churn'))
PS:我使用的数据集包含10000条记录,目标变量的分布为50:50 .