根据回归线的残差距离对一系列变量着色散点图的策略-Java 学习之路

我有一个数据框，我想在其中针对第一列迭代地绘制每列 . 我想制作一个sactterplot并根据它们与回归线的距离对点进行着色 .

head(data)
       data nucleosome  H3K27me3   H3K9me3
[1,] -20000   4.612515 0.3502755 0.4066719
[2,] -19999   4.619391 0.3500934 0.4070110
[3,] -19998   4.622314 0.3496383 0.4066719
[4,] -19997   4.619391 0.3497293 0.4060786
[5,] -19996   4.618532 0.3490921 0.4049767
[6,] -19995   4.620423 0.3490011 0.4047225

基于以下链接：https://www.r-bloggers.com/visualising-residuals/我试过这个并实现了以下目标：
enter image description here

这样做如下：

for(i in seq(2,ncol(data))){
  print(colnames(data)[i])
  fit=lm(paste0('heterochromatin~', colnames(data)[i]), data=as.data.frame(data))
  #print(ggplotRegression(fit))
  g=ggplot(data = as.data.frame(data),aes_string(x=colnames(data)[i], y='nucleosome'))+
  geom_point(aes(color=abs(residuals(fit))))+
  geom_smooth(method=lm, col="blue")+
  scale_color_continuous(low = "red", high = "black")+ labs(color='Residual distance') +
  #annotate("text", x = min(data[,i]), y=max(nucleosome), label = paste("R^2 is", format(summary(fit)$adj.r.squared, digits=3)))+
  geom_text(label = paste("R^2 is", format(summary(fit)$adj.r.squared, digits=3)), x = min(data[,i]), y=max(nucleosome), hjust='inward')+
  theme(axis.text=element_text(size=12),axis.title=element_text(size=14))
  #legend("topleft", bty="n", legend=paste("R2 is", format(summary(fit)$adj.r.squared, digits=3)))
  png(paste0(colnames(data)[i],'_enhanced_ggplot.png'))
  print(g)
  dev.off()
}

但是，我更倾向于将残差分解为零，以便与回归线的距离更加明显，如下所示：

enter image description here

for(i in seq(2,ncol(data))){
  print(colnames(data)[i])
  fit=lm(paste0('heterochromatin~', colnames(data)[i]), data=as.data.frame(data))
  cols=kmeans(abs(residuals(fit)), centers = 4)$cluster
  g=ggplot(data = as.data.frame(data),aes_string(x=colnames(data)[i], y='nucleosome'))+
  geom_point(aes(color=cols))+
  geom_smooth(method=lm, col="blue")+
  scale_color_continuous()+labs(color='Residual distance') 
  annotate("text", x = min(data[,i]), y=max(nucleosome), label = paste("R^2 is", format(summary(fit)$adj.r.squared, digits=3)))
  #legend("topleft", bty="n", legend=paste("R2 is", format(summary(fit)$adj.r.squared, digits=3)))
  png(paste(i,'regression.png'))
  print(g)
  dev.off()
}

要做到这一点，我使用k-means来分隔剩余距离：

cols=kmeans(abs(residuals(fit)), centers = 4)$cluster

但正如你可以从上面看到的那样，没有正确编制索引，最接近该线的点应该是最黑的，如图所示 . 这些情节之间的着色也是不一致的..对于一些最轻的点最接近回归线，对于其他人而言....

我怎样才能获得kmeans /其他一些机制来正确地着色图？

我也试过了

rbPal <- colorRampPalette(c('red','blue'))
cols<- rbPal(10)[as.numeric(cut(abs(residuals(fit)),breaks = 10))]

但我得到错误：

Error: Discrete value supplied to continuous scale

1 回答

您可以使用模型对象中的残差来获取每个点与回归线的距离 . 例如：

library(tidyverse)

m1 = lm(mpg ~ wt, data=mtcars)

ggplot(mtcars %>% mutate(resid=abs(resid(m1)),
                         fitted=fitted(m1))) +
  geom_line(aes(wt, fitted)) + 
  geom_point(aes(wt, mpg, colour=resid)) +
  scale_colour_gradient(low="blue", high="red") +
  theme_classic() +
  labs(x="Weight", y="MPG", colour="Residuals")

enter image description here

如果您想要离散颜色，请将残差转换为一个因子：

ggplot(mtcars %>% mutate(resid=cut(abs(resid(m1)), 4),
                         fitted=fitted(m1))) +
  geom_line(aes(wt, fitted)) + 
  geom_point(aes(wt, mpg, colour=resid)) +
  scale_colour_manual(values=hcl(0,100,seq(70,20,len=4))) +
  theme_classic() +
  labs(x="Weight", y="MPG", colour="Residuals")

enter image description here

要对数据框中的每一列执行此操作，我们将使用 map 为每列运行相同的代码 . 下面代码的输出是一个列表，其中每个列表元素是 mpg 与 mtcars 的每一列的回归结果的关系图：

plot.list = names(mtcars)[-grep("mpg", names(mtcars))] %>%
  map(function(var) { 

    m1 = lm(paste0("mpg ~", var) , data=mtcars)

    ggplot(mtcars %>% mutate(resid=cut(abs(resid(m1)), 4),
                             fitted=fitted(m1))) +
      geom_line(aes_string(var, "fitted")) + 
      geom_point(aes_string(var, "mpg", colour="resid")) +
      scale_colour_manual(values=hcl(0,100,seq(70,20,len=4))) +
      theme_classic() +
      labs(x=var, y="MPG", colour="Residuals")
  })

如果您希望在所有绘图中使用稳定的颜色映射（即，给定的残差值始终映射到相同的颜色），则需要在所有回归中找到最大残差值 . 例如：

# Find largest residual value among all regressions
max.resid = names(mtcars)[-grep("mpg", names(mtcars))] %>%
  map_dbl(~ max(resid(lm(paste0("mpg ~", .x) , data=mtcars)))) %>% max

现在我们在设置 breaks 时使用 max.resid ，当我们将 resid 变成一个因子时：

plot.list = names(mtcars)[-grep("mpg", names(mtcars))] %>%
  map(function(var) { 

    m1 = lm(paste0("mpg ~", var) , data=mtcars)

    ggplot(mtcars %>% mutate(resid=cut(abs(resid(m1)), breaks=seq(0, max.resid, length=5)),
                             fitted=fitted(m1))) +
      geom_line(aes_string(var, "fitted")) + 
      geom_point(aes_string(var, "mpg", colour="resid")) +
      scale_colour_manual(values=hcl(0,100,seq(70,20,len=6)), drop=FALSE) +
      theme_classic() +
      labs(x=var, y="MPG", colour="Residuals")
  })

回复于 2024-05-02T15:11:08+08:00

根据回归线的残差距离对一系列变量着色散点图的策略

1 回答

相关问题