R data.frame流数据预处理用于聚合时间统计

时间:2014-05-15 11:38:12

标签: r awk dataframe aggregate


> df <- data.frame(amount=c(4,3,1,1,4,5,9,13,1,1), size=c(164,124,131,315,1128,331,1135,13589,164,68), tot=1, first=c(1,1,3,3,2,2,2,2,4,4), secs=c(2,2,0,0,1,1,1,1,0,0))
> df
  amount  size   tot first secs
1      4   164     1     1    2
2      3   124     1     1    2
3      1   131     1     3    0
4      1   315     1     3    0
5      4  1128     1     2    1
6      5   331     1     2    1
7      9  1135     1     2    1
8     13 13589     1     2    1
9      1   164     1     4    0
10     1    68     1     4    0


> df2
  time tot amount  size
1    1   2    3.5   144
2    2   6   34.5 16327
3    3   8   36.5 16773
4    4   2    2.0   232



df2 = data.frame()
for (i in 1:nrow(df)) {

  items <- df[i, 'secs']
  idd <- df[i, 'first']

  for (ss in 0:items) {  # run once for secs=0
    if (items == 0) { items <- 1 }

    df2[idd+ss, 'time'] <- idd+ss

    if (is.null(df2[idd+ss, 'tot']) || is.na(df2[idd+ss, 'tot'])) {
      df2[idd+ss, 'tot'] <- df[i, 'tot']
    } else {
      df2[idd+ss, 'tot'] <- df2[idd+ss, 'tot'] + df[i, 'tot']

    if (is.null(df2[idd+ss, 'amount']) || is.na(df2[idd+ss, 'amount'])) {
      df2[idd+ss, 'amount'] <- df[i, 'amount']/items
    } else {
      df2[idd+ss, 'amount'] <- df2[idd+ss, 'amount'] + df[i, 'amount']/items

    if (is.null(df2[idd+ss, 'size']) || is.na(df2[idd+ss, 'size'])) {
      df2[idd+ss, 'size'] <- df[i, 'size']/items
    } else {
      df2[idd+ss, 'size'] <- df2[idd+ss, 'size'] + df[i, 'size']/items


您可以使用循环来优化这一点并获得良好的性能,但我敢打赌,存在更好的算法。也许您可以expand/duplicate secs > 0first,同时增加展开行的amount(时间戳)值并调整sizetot和{ {1}}动态指标:

now original data..

  amount  size   tot first secs
1      4   164     1     1    0
2      4   164     1     1    1
3      3   124     1     1    2

magically becomes

  amount  size   tot first
1      4   164     1     1
2      2    82     1     1
3      2    82     1     2
4      1 41.33     1     1
5      1 41.33     1     2
6      1 41.33     1     3

在预处理步骤之后,使用plyr ddply进行聚合将是微不足道的,当然在高效的并行模式下。



1 个答案:

答案 0 :(得分:0)



dt <- as.data.table(df)

# Using the "expand" solution linked in the Q. 
# +1 to secs to allow room for 0-values
dtr <- dt[rep(seq.int(1, nrow(dt)), secs+1)] 

# Create a new seci column that enumerates sec for each row of dt
dtr[,seci := dt[,seq(0,secs),by=1:nrow(dt)][,V1]]

# All secs that equal 0 are changed to 1 for later division
dtr[secs==0, secs := 1]

# Create time (first+seci) and adjusted amount and size columns
dtr[,c("time", "amount2", "size2") := list(first+seci, amount/secs, size/secs)]

# Aggregate selected columns (tot, amount2, and size2) by time
dtr.a <- dtr[,list(tot=sum(tot), amount=sum(amount2), size=sum(size2)), by=time]

   time tot amount  size
1:    1   2    3.5   144
2:    2   6   34.5 16327
3:    3   8   36.5 16773
4:    4   2    2.0   232