作为 R 的初级用户,尽管在SO上阅读了(1)关于 binning&grouping 的大量帖子,以及(2) data.tabledplyr 包的文档,我仍然无法弄清楚如何应用这些包的强大功能合并连续和因子变量,以进一步用于信用评分建模 .

Problem: To build a code-efficient, easily-customisable, more or less automated solution for binning variables with minimal hard-coding.

这些变量曾经使用存储过程( Oracle )进行分箱,但我想完全切换到 R 并运行以下数据帧:

根据 "binsDF" 中的变量/ bin范围和级别来区分 "df_Raw" 中的变量,并在 "df_Binned" 中存储分箱变量 . 到目前为止,我已经能够生成简单直接的代码,这些代码冗长,容易出错(剪切级别和标签是硬编码的),难以回滚并且只是丑陋;虽然它有效 .

目标是尽可能以最少的硬编码自动进行此分箱操作,以便重新进行分箱,更新 "binsDF" 中的bin ranges&levels 并重新运行代码而不是手动编辑所有硬代码 .

我想知道 **ply family 函数和 dplyr 函数如何很好地应用于这个问题 .

数据描述 - 数据集有100个变量和1-2万个观测值,有两种类型的变量要分箱:

  • Continuous variables. 示例 - OVERDUEAMOUNT - 具有值0(零),"NA"以及负数和正数值 .

OVERDUEAMOUNT 需要拆分为7个区域: bin#1 仅包含零, bins#2-6 包含需要拆分为5个自定义大小间隔的连续值, bin#7 仅包含NA .

  • Factor variables ,包含字符和数字值 . 示例 - PROFESSION - 有4个级别: "NA" 和3个值/代码,代表某些类别的职业/工作类型 .

将零和NA放在2个独立的箱中非常重要,因为它们通常具有彼此非常不同的解释和其他值 .

由于没有NA,字符串或零,因此iris或GermanCredit等数据集不适用,因此我在下面编写了一些代码来复制我的数据 . 提前谢谢了!

要分箱的原始数据 .

OVERDUEAMOUNT_numbers <- rnorm(10000, mean = 9000, sd = 3000)
OVERDUEAMOUNT_zeros <- rep(0, 3000)
OVERDUEAMOUNT_NAs <- rep(NA, 4000)
OVERDUEAMOUNT <- c(OVERDUEAMOUNT_numbers, OVERDUEAMOUNT_zeros, OVERDUEAMOUNT_NAs)

PROFESSION_f1 <- rep("438", 3000) 
PROFESSION_f2 <- rep("000", 4000)
PROFESSION_f3 <- rep("selfemployed", 5000)
PROFESSION_f4 <- rep(NA, 5000)
PROFESSION <- c(PROFESSION_f1, PROFESSION_f2, PROFESSION_f3, PROFESSION_f4)

ID <- sample(123456789:987654321, 17000, replace = TRUE); n_distinct(ID)

df_Raw <- cbind.data.frame(ID, OVERDUEAMOUNT, PROFESSION) 
colnames(df_Raw) <- c("ID", "OVERDUEAMOUNT", "PROFESSION")

PROFESSION 转换为因子复制此变量已处理并准备好进一步导入R.重新排列数据帧,使其看起来像真实数据 .

df_Raw$PROFESSION <- as.factor(df_Raw$PROFESSION) 
df_Raw <- df_Raw[sample(nrow(df_Raw)), ]

带有分档的数据帧 .

variable <- c(rep("OVERDUEAMOUNT", 7), rep("PROFESSION", 4))
min <- c(0, c(-Inf, 1500, 4000, 8000, 12000), "", c("438", "000", "selfemployed", ""))
max <- c(0, c(1500, 4000, 8000, 12000, Inf), "", c("438", "000", "selfemployed", ""))
bin <- c(c(1, 2, 3, 4, 5, 6, 7), c(1, 2, 3, 4))

binsDF <- cbind.data.frame(variable, min, max, bin)
colnames(binsDF) <- c("variable", "min", "max", "bin")

我如何对变量进行分区:在单独的数据帧中“按原样”复制ID列表,以便在半连接中进一步用作原始ID列表的“参考/标准” .

dfID <- as.data.frame(df_Raw$ID); colnames(dfID) <- c("ID")

Continuous variable - OVERDUEAMOUNT . 将变量拆分为3个临时数据帧:零,NA和要剪切的数字观察 .

df_tmp_zeros <- subset(x=df_Raw, subset=(OVERDUEAMOUNT == 0), select=c(ID, OVERDUEAMOUNT)); nrow(df_tmp_zeros)
df_tmp_NAs <- subset(x=df_Raw, subset=(is.na(OVERDUEAMOUNT)), select=c(ID, OVERDUEAMOUNT)); nrow(df_tmp_NAs)
df_tmp_numbers <- subset(x=df_Raw, subset=(OVERDUEAMOUNT != 0 & OVERDUEAMOUNT != is.na(OVERDUEAMOUNT)), select=c(ID, OVERDUEAMOUNT)); nrow(df_tmp_numbers)
(nrow(df_tmp_zeros) + nrow(df_tmp_NAs) + nrow(df_tmp_numbers)) == nrow(df_Raw) # double-check that all observations are split into 3 parts.

用适当的bin编号替换零和NA . 指定间隔数,间隔范围和分区数值到区间 . 将变量切割为间隔 . 合并3个部分 . 将binned变量附加到最终数据帧 .

df_tmp_zeros$OVERDUEAMOUNT <- as.factor(1)
df_tmp_NAs$OVERDUEAMOUNT <- as.factor(7)
cuts.OVERDUEAMOUNT <- c(-Inf, 1500, 4000, 8000, 12000, Inf) 
labels.OVERDUEAMOUNT <- c(2:6) 
df_tmp_numbers$OVERDUEAMOUNT <- cut(df_tmp_numbers$OVERDUEAMOUNT, breaks = cuts.OVERDUEAMOUNT, labels = labels.OVERDUEAMOUNT, right = FALSE)

df_tmp_allback <- rbind(df_tmp_zeros, df_tmp_NAs, df_tmp_numbers)
nrow(df_tmp_allback) == nrow(df_Raw) # double-check that all observations are added back. 

df_semijoin <- semi_join(x=df_tmp_allback, y=dfID, by=c("ID")) # return all rows from x where there are matching values in y, keeping just columns from x.
glimpse(df_semijoin); summary(df_semijoin)

df_Binned <- df_semijoin
str(df_Binned)

因子变量 - 专业 . 将变量拆分为几个临时数据帧:NAs和其他因子级别一样多的部分 .

df_tmp_f1 <- subset(x=df_Raw, subset=(df_Raw$PROFESSION == "438"), select=c(ID, PROFESSION)); nrow(df_tmp_f1)
df_tmp_f2 <- subset(x=df_Raw, subset=(df_Raw$PROFESSION == "000"), select=c(ID, PROFESSION)); nrow(df_tmp_f2)
df_tmp_f3 <- subset(x=df_Raw, subset=(df_Raw$PROFESSION == "selfemployed"), select=c(ID, PROFESSION)); nrow(df_tmp_f3)
df_tmp_NAs <- subset(x=df_Raw, subset=(is.na(PROFESSION)), select=c(ID, PROFESSION)); nrow(df_tmp_NAs)

df_tmp_f1$PROFESSION <- as.factor(1)
df_tmp_f2$PROFESSION <- as.factor(2)
df_tmp_f3$PROFESSION <- as.factor(3)
df_tmp_NAs$PROFESSION <- as.factor(4)

df_tmp_allback <- rbind(df_tmp_f1, df_tmp_f2, df_tmp_f3, df_tmp_NAs)
nrow(df_tmp_allback) == nrow(df_Raw) # double-check that all observations are added back. 

df_semijoin <- semi_join(x=df_tmp_allback, y=dfID, by=c("ID")) # return all rows from x where there are matching values in y, keeping just columns from x.
str(df_semijoin); summary(df_semijoin)

df_Binned <- cbind(df_Binned, df_semijoin$PROFESSION)
str(df_Binned)

等等...

附:更新:这篇文章给出了这个问题的最佳解决方案 . roll join with start/end window

这些帖子也很有帮助:How to join (merge) data frames (inner, outer, left, right)? Why does X[Y] join of data.tables not allow a full outer join, or a left join?

这个想法如下:使用原始数据从数据框中创建一个子集(1列具有唯一ID,1列包含原始数据(变量值)和1列具有变量名称(使用rep()重复变量)名称多次是对变量的观察;然后从数据框中创建一个子集,其中只有一个变量(与该特定变量的多个行一样多的行),在我的情况下为4列 - 变量,最小值,最大值,Bin . 请参阅下面的示例代码:

我也试过data.table包中的foverlaps(),但是它无法处理NAs; NAs的处理必须单独进行AFAIU;另一个解决方案是使用滚动连接,但我还没有破解 . 将赞赏滚动连接的建议 .

# Subset numeric variables by variable name. 

rawDF_num_X <- cbind(rawDF2bin$ID, 
                 rep(var_num, times = nrow(rawDF2bin[, vars_num_raw][var_num])), 
                 rawDF2bin[, vars_num_raw][var_num])
colnames(rawDF_num_X) <- c("ID", "Variable", "Value")
rawDF_num_X <- as.data.table(rawDF_num_X)

# Subset table with bins for numeric variables by variable name. 

bins_num_X <- bins_num[bins_num$Variable == var_num, ] 
bins_num_X <- arrange(bins_num_X, Bin) # sort by bin values, in ascending order. 
bins_num_X <- as.data.table(bins_num_X)

# Select and join numeric variables with their corresponding bins using sqldf package. 

vars_num_join <- sqldf("SELECT a.ID, a.Variable, a.Value, b.Min, b.Max, b.Bin 
                        FROM rawDF_num_X AS a, bins_num_X AS b 
                        WHERE a.Variable = b.Variable AND a.Value between b.Min and b.Max
                        OR a.Value IS NULL AND b.Min IS NULL AND b.Max IS NULL") 
View(vars_num_join); dim(vars_num_join)

# Create a TRUE/FALSE flag/check according to the binning conditions. 

vars_num_join$check <- ifelse((is.na(vars_num_join$Value)= TRUE & is.na(vars_num_join$Min) == TRUE & is.na(vars_num_join$Max) == TRUE), "TRUE", 
                               ifelse((vars_num_join$Value == 0 & vars_num_join$Min == 0 & vars_num_join$Max == 0), "TRUE", 
                                      ifelse((vars_num_join$Value != 0 & vars_num_join$Value >= vars_num_join$Min & vars_num_join$Value < vars_num_join$Max), "TRUE", "FALSE")))

# Remove (duplicate) rows that have FALSE flag due to not matching the binning conditions. 

vars_num_join <- vars_num_join[vars_num_join$check == TRUE, ]
identical(rawDF2bin$ID, vars_num_join$ID) # should be TRUE