library(magrittr)
library(dplyr)
V1 <- c("A","A","A","A","A","A","B","B","B","B", "B","B","C","C","C","C","C","C","D","D","D","D","D","D","E","E","E","E","E","E")
V2 <- c("A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F")
cor <- c(1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.9)
df <- data.frame(V1,V2,cor)
# exclude rows where cor=NA
df <- df[complete.cases(df)==TRUE,]
这是完整的数据帧,cor = NA表示小于0.8的相关性
df
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
30 E F 0.9
在上面的df中,F不在V1中,意味着F不是感兴趣的
所以在这里我删除V2 = F的行(更一般地说,V2等于不在V1中的值)
V1.LIST <- unique(df$V1)
df.gp <- df[which(df$V2 %in% V1.LIST),]
df.gp
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
所以现在,df.gp是我需要处理的数据集
我将未使用的级别丢弃在V2中(在示例中为F)
df.gp$V2 <- droplevels(df.gp$V2)
我不想排除自相关变量,以防一些V1与其他变量没有关联,我想把它们分别放在一个单独的组中
通过观察cor,A和B是相关的,C和D是相关的,并且E本身属于一个组 .
因此,这里的例子应该有三组 .
1 回答
我看到这个的方式,通过将数据直接用于
data.frame
,您可能会遇到复杂的问题 . 我冒昧地将它转换回矩阵 .在我得到相关矩阵之后,很容易看出哪些索引或非NA值与其他变量共享 .
现在
X1
或X2
确定您的唯一分组 .cyrusjan编辑:
假设我们已经选择
cor >= a
中的行,上面的脚本是一个可能的解决方案,其中a
是上述问题中取为0.8的阈值 .供稿人:alexis_laz:
通过使用
cutree
和hclust
,我们可以将脚本中的阈值(即h = 0.8)设置为打击 .