我对R很新,试图实现随机森林算法 .

我的培训和测试集有60种格式的功能:

Train: feature1,feature2 .. feature60,Label

Test: FileName,feature1,feature2 ... feature60

火车样本

mov-mov,or-or,push-push,or-mov,sub-sub,mov-or,sub-mov,xor-or,call-sub,mul-imul,mov-push,push-mov,push-call,or-jz,mov-mul,cmp-or,mov-sub,sub-or,or-sub,or-push,jnz-or,jmp-sub,or-in,mov-call,retn-sub,mul-mul,or-jmp,imul-mul,pop-pop,nop-nop,nop-mul,sub-push,imul-mov,test-or,mul-mov,lea-push,std-mov,in-call,or-call,mov-std,mov-cmp,std-mul,call-or,jz-mov,push-or,pop-retn,add-mov,mov-add,mov-xor,in-inc,mov-pop,in-or,in-push,push-lea,lea-mov,mov-lea,sub-add,std-std,sub-cmp,or-cmp,Label
687,1346,1390,1337,750,2770,1518,418,1523,0,441,532,612,512,0,411,354,310,412,495,134,236,318,237,226,0,0,0,200,0,0,386,39,365,0,0,0,125,528,0,125,0,41,260,169,143,149,61,89,0,127,126,107,44,45,40,79,0,273,157,9
812,873,83,533,88,484,264,106,199,0,188,137,128,51,38,92,131,102,52,58,37,26,428,95,107,0,34,0,58,0,0,39,0,26,0,27,0,152,152,0,45,0,124,0,0,73,84,88,22,23,59,319,105,56,86,47,0,0,43,41,2

测试样本

FileName,mov-mov,or-or,push-push,or-mov,sub-sub,mov-or,xor-or,sub-mov,call-sub,mul-imul,push-mov,mov-push,push-call,mov-mul,or-jz,cmp-or,mov-sub,sub-or,or-sub,or-push,jmp-sub,jnz-or,or-in,mul-mul,or-jmp,mov-call,retn-sub,imul-mul,nop-mul,pop-pop,nop-nop,imul-mov,sub-push,mul-mov,test-or,lea-push,std-mov,or-call,mov-std,in-call,std-mul,mov-cmp,call-or,push-or,jz-mov,pop-retn,in-or,add-mov,mov-add,in-inc,mov-xor,in-push,push-lea,mov-pop,lea-mov,mov-lea,mov-nop,or-cmp,sub-add,sub-cmp
Ig2DB5tSiEy1cJvV0zdw,166,360,291,194,41,201,62,61,41,18,85,56,121,18,15,0,57,131,113,123,0,9,54,0,0,18,15,0,0,15,0,8,25,0,0,11,0,70,0,43,0,0,63,37,0,14,51,43,56,36,26,0,20,14,17,14,0,9,18,0
k4HCwy5WRFXczJU6eQdT,3,88,106,23,104,0,12,43,59,0,65,87,99,0,2,2,47,22,4,53,1,5,0,0,0,0,46,0,0,0,0,0,4,0,0,6,0,44,0,21,0,0,0,0,0,0,0,2,1,1,3,0,1,2,9,2,0,0,44,2

所以我到目前为止在R里的是这个,

library(randomForest);
dat <- read.csv("train-sample.csv", sep=",", h=T);
test <- read.csv("test-sample.csv", sep=",", h=T);
attach(dat);


#If I do this, I get Error: unexpected 'in' ...
rfmodel = randomForest (Label ~ mov-mov + or-or + push-push + or-mov + sub-sub + mov-or + sub-mov + xor-or + call-sub + mul-imul + mov-push + push-mov + push-call + or-jz + mov-mul + cmp-or + mov-sub + sub-or + or-sub + or-push + jnz-or + jmp-sub + or-in + mov-call + retn-sub + mul-mul + or-jmp + imul-mul + pop-pop + nop-nop + nop-mul + sub-push + imul-mov + test-or + mul-mov + lea-push + std-mov + in-call + or-call + mov-std + mov-cmp + std-mul + call-or + jz-mov + push-or + pop-retn + add-mov + mov-add + mov-xor + in-inc + mov-pop + in-or + in-push + push-lea + lea-mov + mov-lea + sub-add + std-std + sub-cmp + or-cmp, data=dat);

#If I do this, I get Error in terms.formula(formula, data = data) : invalid model formula in ExtractVars
rfmodel = randomForest (Label ~ 'mov-mov' + 'or-or' + 'push-push' + or-mov + sub-sub + mov-or + sub-mov + xor-or + call-sub + mul-imul + mov-push + push-mov + push-call + or-jz + mov-mul + cmp-or + mov-sub + sub-or + or-sub + or-push + jnz-or + jmp-sub + 'or-in' + mov-call + retn-sub + mul-mul + or-jmp + imul-mul + pop-pop + nop-nop + nop-mul + sub-push + imul-mov + test-or + mul-mov + lea-push + 'std-mov' + 'in-call' + 'or-call' + 'mov-std' + 'mov-cmp' + 'std-mul' + 'call-or' + 'jz-mov' + 'push-or' + 'pop-retn' + 'add-mov' + 'mov-add' + 'mov-xor' + 'in-inc' + 'mov-pop' + 'in-or' + 'in-push' + 'push-lea' + 'lea-mov' + 'mov-lea' + 'sub-add' + 'std-std' + 'sub-cmp' + 'or-cmp', data=dat);


#I even tried this and got Error in na.fail.default(list(Label = c(9L, 2L, 9L, 1L, 8L, 6L, 2L, 2L,  :   missing values in object
rfmodel <- randomForest(Label~., dat);

所以我有点卡住了 . 我想最终使用类似的东西,

predicted <- predict(rfmodel, test, type="response");
prop.table(table(test$FileName, predicted),1);

获得以下形式的输出:

FileName,Label1,Label2,Label3 .. Label9

name1,0.98,0,0.02,0,0..0

(基本上是每个标签概率的fileName)

任何帮助表示赞赏 . 谢谢 .