Question

我是一名社会科学研究员，致力于想象人们如何在社区中逐步发挥各种角色。

我已将人们的月度行为聚集到角色类别中，现在我想要想象每个（相对）时间段内每个角色中人员的数量和比例。

目前，数据采用CSV格式，如下所示：

ID  T1  T2  T3 ...
1   2   2   3
2   1   0   2
3   1   2   1
...

其中X（ij）是我们在第j个月期间所处的群集ID。

我想要的是这样的东西（我在LibreOffice中创建的）。 enter image description here

我相信我需要使用ggplot2，但我一直在努力弄清楚如何以ggplot喜欢的格式获取数据。

我想我的第一个任务是在每个时间段汇总每个群集？有没有一种简单的方法可以做到这一点？

我可以使用以下代码执行此操作，但这很糟糕且很混乱，并且必须有更好的方法来执行此操作吗？

clus1 <- apply(clusters, 2, function(x) {sum(x=='1', na.rm=TRUE)})
clus2 <- apply(clusters, 2, function(x) {sum(x=='2', na.rm=TRUE)})
clus3 <- apply(clusters, 2, function(x) {sum(x=='3', na.rm=TRUE)})
clus0 <- apply(clusters, 2, function(x) {sum(x=='0', na.rm=TRUE)})
clusters2 <- data.frame(clus0, clus1, clus2, clus3)
c2 <- t(clusters2)
c3 <- as.data.frame(c2)
c3$id = c('Low Activity Cluster', 'Cluster 1', 'Cluster 2', 'Cluster 3')
c3 <- c3[order(c3$'id'),]
print(ggplot(melt(c3, id.vars="id")) +
  geom_area(aes(x=variable, y=value, fill=id, group=id), position="fill"))

对于样本数据，这会产生类似的结果：

id                      T1  T2  T3
Low Activity Cluster     0   1   0
Cluster 1                2   0   1
Cluster 2                1   2   1
Cluster 3                0   0   1

这是正确的策略吗？

Answer 1

编辑，试图发表评论：

`rownames<-`(
  as.data.frame(lapply(df[-1], function(x) as.numeric(table(x)))), 
  paste("Clust ", 0:3)
)

产地：

         T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
Clust  0  4  3  5  8 11  6  2  4  5   7
Clust  1  5  9  8  6  3  7  7  8  7   4
Clust  2  5  6  2  3  3  3  2  3  4   4
Clust  3  6  2  5  3  3  4  9  5  4   5

使用table计算每个时间段的每个群集类型（0：3）的出现次数。关键的代码是lapply(...)。周围的东西只是显示得很漂亮。

使用数据：

set.seed(1)
labels <- paste("Clust ", 0:3)
df <- as.data.frame(c(list(ID=1:20), setNames(replicate(10, factor(sample(0:3, 20, rep=T)), simplify=F), paste0("T", 1:10))))

这是一个ggplot解决方案。首先，您需要使用melt包中的reshape2将数据转换为长格式，然后您可以将其聚合（可选择重新投影），然后绘制它：

library(reshape2)
library(ggplot2)
df.mlt <- melt(df, id.vars="ID")
df.agg <- aggregate(. ~ ID + variable, df.mlt, sum)
dcast(df.agg, ID ~ variable)  # just for show, we don't use the result anyplace

#   ID T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
# 1  0 25 18 29 23 16 15 14 22 29  19
# 2  1  7  7 14 18 19 11 21 17 15  22
# 3  2 16 15 16 20 23 20 16 13 15  12
# 4  3 14 13 20 17 25 14 13  7 21  24

ggplot(df.agg) +
  geom_area(aes(x=variable, y=value, fill=ID, group=ID), position="fill")

enter image description here

需要一点点习惯ggplot，但是一旦你习惯了，它就会非常直观。您应该先查看melt(df, id.vars="ID")的结果，先看看“长格式”的含义。然后，在这种情况下，我们使用geom_area，并在aes中指定为“美学”（随数据更改的值）x值（variable是一个名称由melt生成，在这种情况下，它包含time值），y值（value也由melt创建），并且还指定了我们区域填充的颜色应来自ID。请注意，因为我们在这里使用的时间是分类的（T1，T2等，而不是实际日期），除了group之外我们必须使用fill，以便ggplot知道您想要在不同时间连接点。

请注意，您无需在绘图之前执行聚合步骤。 ggplot可以在内部处理它。以下命令是等效的（请注意我们如何使用df.mlt）：

ggplot(df.mlt) +
  stat_summary(aes(x=variable, y=value, fill=ID, group=ID), fun.y=sum, position="fill", geom="area")

这是我使用的数据：

df <- as.data.frame(c(list(ID=rep(factor(0:3), 3)), setNames(replicate(10, sample(1:10, 12, rep=T), simplify=F), paste0("T", 1:10))))

从时间数据创建堆积区域图

1 个答案: