在data.table中按组绘图

时间:2015-02-08 23:06:55

标签: r data.table

我已经获得了个人级数据,我试图按组动态汇总结果。

示例:

set.seed(12039)
DT <- data.table(id = rep(1:100, each = 50),
                 grp = rep(letters[1:4], each = 1250),
                 time = rep(1:50, 100),
                 outcome = rnorm(5000))

我想知道绘制小组级摘要的最简单方法,其数据包含在:

DT[ , mean(outcome), by = .(grp, time)]

我想要类似的东西:

dt[ , plot(mean(outcome)), by = .(grp, time)]

但这根本不起作用。

我幸存的可行选项(可以很容易地循环)是:

plot(DT[grp == "a", mean(outcome), by = time])
lines(DT[grp == "b", mean(outcome), by = time])
lines(DT[grp == "c", mean(outcome), by = time])
lines(DT[grp == "d", mean(outcome), by = time])

(添加了颜色等参数,为简洁而排除)

这让我觉得不是最好的方法 - 考虑到data.table处理小组的工艺,是不是有更优雅的解决方案?

其他消息来源一直指向我matplot,但我无法看到使用它的简单方法 - 我是否需要重新塑造DT,并且有一个简单的{{1}那会完成工作吗?

4 个答案:

答案 0 :(得分:4)

你正走在正确的轨道上。使用ggplot执行此操作,如下所示:

(dt_agg <- dt[,.(mean = mean(outcome)),by=list(grp,time)]) # Aggregated data.table
     grp time        mean
  1:   a    1  0.75865672
  2:   a    2  0.07244879
 ---

现在ggplot这个聚合的data.table

require(ggplot2)
ggplot(dt_agg, aes(x = time, y = mean, col = grp)) + geom_line()

结果: enter image description here

答案 1 :(得分:4)

使用matplotdcast

R 解决方案
dt_agg <- dt[ , .(mean = mean(outcome)), by=.(grp,time)]
dt_cast <- dcast(dt_agg, time~grp, value.var="mean")
dt_cast[ , matplot(time, .SD[ , !"time", with=FALSE],
                   type="l", ylab="mean", xlab="")]
#or, if you've got the data.table version 1.9.7+:
#  (see https://github.com/Rdatatable/data.table/wiki/Installation)
dt_cast[ , matplot(time, .SD, type="l", ylab="mean", xlab=""), .SDcols = !"time"]

结果: enter image description here

答案 2 :(得分:4)

There is a way to do this with data.table's by argument as follows:

DT[ , mean(outcome), by = .(grp, time)
    ][ , {plot(NULL, xlim = range(time),
           ylim = range(V1)); .SD}
       ][ , lines(time, V1, col = .GRP), by = grp]

Note that the intermediate {...; .SD} part is necessary to continue chaining. If DT[ , mean(outcome), by = .(grp, time)] were already stored as another data.table, DT_m, then we could just do:

DT_m[ , plot(NULL, xlim = range(time), ylim = range(V1))]
DT_m[ , lines(time, V1, col = .GRP), by = grp]

With output

data.table group by

Much fancier results are possible; for example, if we wanted to specify specific colors for each group:

grp_col <- c(a = "blue", b = "black",
             c = "darkgreen", d = "red")
DT[ , mean(outcome), by = .(grp, time)
    ][ , {plot(NULL, xlim = range(time),
           ylim = range(V1)); .SD}
       ][ , lines(time, V1, col = grp_col[.BY$grp]), by = grp]

NOTE

There is a bug in RStudio which will cause this code to fail if the output is sent to the RStudio graphics device. As such this approach only works from R on the command line or from sending the output to an external device (I sent it to png to produce the above).

See data.table issue #1524, this RStudio support ticket, and these SO Qs (1 and 2)

答案 3 :(得分:0)

使用reshape2,您可以将数据集转换为具有以下含义的内容:

new_dt <- dcast(dt,time~grp,value.var='outcome',fun.aggregate=mean)

new_dt_molten <- melt(new_dt,id.vars='time')

然后用ggplot2绘制它,如下所示:

ggplot(new_dt_molten,aes(x=time,y=value,colour=variable)) + geom_line()

或者,(实际上更简单的解决方案)您可以直接使用您拥有的数据集,并执行以下操作:

ggplot(dt,aes(x=time,y=outcome,colour=grp)) + geom_jitter() + geom_smooth(method='loess')

ggplot(dt,aes(x=time,y=outcome,colour=grp)) + geom_smooth(method='loess')