R-COUNTIFS功能

时间:2016-01-08 19:18:49

标签: r

对于每个时间戳,我想计算坐标高于参考值的火车/公共汽车的数量。 请参见下面的数据集:

structure(list(timestamp = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), type = structure(c(3L, 
3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 
1L, 1L, 1L, 1L, 2L), .Label = c("Bus", "Reference", "Train"), class = "factor"), 
    X.coordinate = c(470L, -300L, 25L, 456L, 37L, 19L, 798L, 
    -56L, 489L, 412L, 350L, 278L, 970L, -65L, -894L, 780L, 265L, 
    -25L, 365L, 785L, 95L, 85L)), .Names = c("timestamp", "type", 
"X.coordinate"), row.names = c(NA, -22L), class = "data.frame")

我想添加2列。一列为每行提供"引用"坐标高于参考值的列车数量。第二列应该给出坐标高于参考值的总线数量。

见下文所需的输出:

 timestamp      type X.coordinate Train_with_higher_X Bus_with_higher_X
1          1     Train          470                  NA                NA
2          1     Train         -300                  NA                NA
3          1     Train           25                  NA                NA
4          1     Train          456                  NA                NA
5          1     Train           37                  NA                NA
6          1       Bus           19                  NA                NA
7          1       Bus          798                  NA                NA
8          1       Bus          -56                  NA                NA
9          1       Bus          489                  NA                NA
10         1       Bus          412                  NA                NA
11         1 Reference          350                   2                 3
12         2     Train          278                  NA                NA
13         2     Train          970                  NA                NA
14         2     Train          -65                  NA                NA
15         2     Train         -894                  NA                NA
16         2     Train          780                  NA                NA
17         2       Bus          265                  NA                NA
18         2       Bus          -25                  NA                NA
19         2       Bus          365                  NA                NA
20         2       Bus          785                  NA                NA
21         2       Bus           95                  NA                NA
22         2 Reference           85                   3                 4

我尝试在R中编写一些代码以优雅的方式来实现这一点,但我的编程技巧目前来说太有限了。有没有人知道这个问题的解决方案?先谢谢你了!

3 个答案:

答案 0 :(得分:1)

这可以解决问题(使用data.table):

library(data.table)
setDT(DF)

DF[ , Train_with_higher_X :=
      sum(X.coordinate[type == "Train"] > 
            X.coordinate[type == "Reference"]), by = timestamp]

同样适用于Bus_with_higher_X

对于NA不是type的行,将值保留为"Reference"会有点复杂,但如果您已经承诺,请修复它事后:

DF[type != "Reference", Train_with_higher_X := NA]

此方法还依赖于type的每个值中只有一行"Reference" timestamp

答案 1 :(得分:1)

作为在单个数据框中混合摘要数据和原始数据的替代方法(因此在摘要列中有大量缺失值),这些选项如何:

library(dplyr)

按时间戳计算公共汽车和火车的数量大于参考值:

dat %>% group_by(timestamp) %>%
  mutate(Reference = X.coordinate[type=="Reference"]) %>%
  filter(type != "Reference") %>%
  group_by(timestamp, type) %>%
  summarise(Reference = unique(Reference), 
            numGTref = sum(X.coordinate > Reference))

  timestamp   type Reference numGTref
1         1    Bus       350        3
2         1  Train       350        2
3         2    Bus        85        4
4         2  Train        85        3

按时间戳标记大于参考的公共汽车和火车:

dat %>% group_by(timestamp) %>%
  mutate(Reference = X.coordinate[type=="Reference"]) %>%
  filter(type != "Reference") %>%
  group_by(timestamp, type) %>%
  mutate(Status = ifelse(X.coordinate > Reference, 
                         paste("Greater than", Reference), 
                         paste("Less than", Reference))) %>%
  select(-Reference)

   timestamp   type X.coordinate           Status
1          1  Train          470 Greater than 350
2          1  Train         -300    Less than 350
3          1  Train           25    Less than 350
4          1  Train          456 Greater than 350
5          1  Train           37    Less than 350
6          1    Bus           19    Less than 350
7          1    Bus          798 Greater than 350
8          1    Bus          -56    Less than 350
9          1    Bus          489 Greater than 350
10         1    Bus          412 Greater than 350
11         2  Train          278  Greater than 85
12         2  Train          970  Greater than 85
13         2  Train          -65     Less than 85
14         2  Train         -894     Less than 85
15         2  Train          780  Greater than 85
16         2    Bus          265  Greater than 85
17         2    Bus          -25     Less than 85
18         2    Bus          365  Greater than 85
19         2    Bus          785  Greater than 85
20         2    Bus           95  Greater than 85

答案 2 :(得分:1)

熟悉data.tabledplyr这样的软件包,就像其他答案所暗示的那样,绝对是您宝贵时间的宝贵投资。但是如果你想用更熟悉的工具来解决这个问题,那也是可能的。通过将一些基本R函数mapplyifelsewith组合到以下代码行中(我已将数据命名为您的数据),您可以在没有任何额外包的情况下执行此操作框架d

bus_sum <- function(x, y) with(d[d$timestamp == x & d$type == "Bus",], sum(X.coordinate > y))
d$Bus_with_higher_X <- ifelse(d$type == "Reference", mapply(FUN = bus_sum, d$timestamp, d$X.coordinate), NA)

train_sum <- function(x, y) with(d[d$timestamp == x & d$type == "Train",], sum(X.coordinate > y))
d$Train_with_higher_X <- ifelse(d$type == "Reference", mapply(FUN = bus_sum, d$timestamp, d$X.coordinate), NA)

结果几乎就是你想要的,我相信。

   timestamp      type X.coordinate Train_with_higher_X Bus_with_higher_X
1          1     Train          470                  NA                NA
2          1     Train         -300                  NA                NA
3          1     Train           25                  NA                NA
4          1     Train          456                  NA                NA
5          1     Train           37                  NA                NA
6          1       Bus           19                  NA                NA
7          1       Bus          798                  NA                NA
8          1       Bus          -56                  NA                NA
9          1       Bus          489                  NA                NA
10         1       Bus          412                  NA                NA
11         1 Reference          350                   2                 3
12         2     Train          278                  NA                NA
13         2     Train          970                  NA                NA
14         2     Train          -65                  NA                NA
15         2     Train         -894                  NA                NA
16         2     Train          780                  NA                NA
17         2       Bus          265                  NA                NA
18         2       Bus          -25                  NA                NA
19         2       Bus          365                  NA                NA
20         2       Bus          785                  NA                NA
21         2       Bus           95                  NA                NA
22         2 Reference           85                   3                 4

函数bus_sumtrain_sum采用时间戳和参考坐标,并计算X.coordinate大于参考坐标的每种类型的行数(使用{{ 1}}仅限于考虑与该时间戳和类型匹配的数据帧部分。这两个函数仅在with vs d$type == "Bus"中有所不同,因此用函数的另一个参数来概括它很容易。将每个函数放在d$type == "Train"中会调用它来获取每行的时间戳和类型。然后将mapply包裹在mapply内,只需手动将所有非参考行设置为ifelse