如何使用stat = 'identity'保持ggplot中的堆栈顺序？-Java 学习之路

在使用 stack='identity' 绘制堆积条形图时，我注意到每个条形图中堆栈的顺序是不同的，看似随机 . 这是使用 stat='bin' 的图，也就是说，ggplot在绘图之前动态计算每个类别中的元素数量（ data.table 稍后会出现）：

library(ggplot2)
library(data.table)

diamonds <- data.table(diamonds)

ggplot(diamonds, aes(clarity, fill = cut)) + geom_bar(position="fill")

enter image description here

在每个条形图中，'cut'的顺序遵循因子顺序 . 但是，如果我在绘图前总结，并使用 stat=identity ，则此订单将丢失：

diamonds_sum <- diamonds[, list(.N), by=list(cut, clarity)]
ggplot(diamonds_sum, aes(clarity, y=N, fill = cut)) + geom_bar(stat="identity", position="fill")

enter image description here

尽管两个表中的级别顺序相同，但仍会发生这种情况：

levels(diamonds_sum$cut) == levels(diamonds$cut)
[1] TRUE TRUE TRUE TRUE TRUE

所以问题是2倍：（i）为什么堆栈的顺序不一样？（ii）如何解决这个问题？

简单的解决方案当然一直使用 stat='bin' ，但是我的真实数据集有几个百万条目，总结然后绘图更快 .

1 回答

原因是在汇总后，新订单中的订单会混淆. ggplot2 将行的顺序作为输入 . 比较例如以下两种方法的输出（只显示前10行，因为它们说明差异足够好）：

> diamonds[, .N, by=.(cut, clarity)]
          cut clarity    N
 1:     Ideal     SI2 2598
 2:   Premium     SI1 3575
 3:      Good     VS1  648
 4:   Premium     VS2 3357
 5:      Good     SI2 1081
 6: Very Good    VVS2 1235
 7: Very Good    VVS1  789
 8: Very Good     SI1 3240
 9:      Fair     VS2  261
10: Very Good     VS1 1775

> diamonds[, .N, by=.(cut, clarity)][order(clarity,cut)]
          cut clarity    N
 1:      Fair      I1  210
 2:      Good      I1   96
 3: Very Good      I1   84
 4:   Premium      I1  205
 5:     Ideal      I1  146
 6:      Fair     SI2  466
 7:      Good     SI2 1081
 8: Very Good     SI2 2100
 9:   Premium     SI2 2949
10:     Ideal     SI2 2598

如您所见，原始代码导致混合行，而第二种方法导致行顺序 . 所以，当你这样做时：

diamonds_sum <- diamonds[, .N, by=.(cut, clarity)][order(clarity,cut)]

然后用：

ggplot(diamonds_sum, aes(clarity, y=N, fill = cut)) + 
  geom_bar(stat="identity", position="fill")

你得到了理想的结果：

enter image description here

此外 dplyr 会给你相同的行为 . 然后，您需要 arrange 才能获得正确的订单 . 比较以下两个的输出：

diamonds %>% group_by(cut, clarity) %>% tally()
diamonds %>% group_by(cut, clarity) %>% tally() %>% arrange(clarity,cut)

用基数R汇总不会导致您描述的问题 . 当你这样做时：

diamonds_sum <- aggregate(diamonds[,"cut",with=FALSE], list(diamonds$cut,diamonds$clarity), length)

然后用：

ggplot(diamonds_sum, aes(Group.2, y=cut, fill = Group.1)) + 
  geom_bar(stat="identity", position="fill")

你得到了正确的结果：

enter image description here

回复于 2024-04-29T22:10:04+08:00

如何使用stat = 'identity'保持ggplot中的堆栈顺序？

1 回答

相关问题