Question

对于每个时间戳，我想计算坐标高于参考值的火车/公共汽车的数量。请参见下面的数据集：

structure(list(timestamp = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), type = structure(c(3L, 
3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 
1L, 1L, 1L, 1L, 2L), .Label = c("Bus", "Reference", "Train"), class = "factor"), 
    X.coordinate = c(470L, -300L, 25L, 456L, 37L, 19L, 798L, 
    -56L, 489L, 412L, 350L, 278L, 970L, -65L, -894L, 780L, 265L, 
    -25L, 365L, 785L, 95L, 85L)), .Names = c("timestamp", "type", 
"X.coordinate"), row.names = c(NA, -22L), class = "data.frame")

我想添加2列。一列为每行提供＆＃34;引用＆＃34;坐标高于参考值的列车数量。第二列应该给出坐标高于参考值的总线数量。

见下文所需的输出：

 timestamp      type X.coordinate Train_with_higher_X Bus_with_higher_X
1          1     Train          470                  NA                NA
2          1     Train         -300                  NA                NA
3          1     Train           25                  NA                NA
4          1     Train          456                  NA                NA
5          1     Train           37                  NA                NA
6          1       Bus           19                  NA                NA
7          1       Bus          798                  NA                NA
8          1       Bus          -56                  NA                NA
9          1       Bus          489                  NA                NA
10         1       Bus          412                  NA                NA
11         1 Reference          350                   2                 3
12         2     Train          278                  NA                NA
13         2     Train          970                  NA                NA
14         2     Train          -65                  NA                NA
15         2     Train         -894                  NA                NA
16         2     Train          780                  NA                NA
17         2       Bus          265                  NA                NA
18         2       Bus          -25                  NA                NA
19         2       Bus          365                  NA                NA
20         2       Bus          785                  NA                NA
21         2       Bus           95                  NA                NA
22         2 Reference           85                   3                 4

我尝试在R中编写一些代码以优雅的方式来实现这一点，但我的编程技巧目前来说太有限了。有没有人知道这个问题的解决方案？先谢谢你了！

Answer 1

这可以解决问题（使用data.table）：

library(data.table)
setDT(DF)

DF[ , Train_with_higher_X :=
      sum(X.coordinate[type == "Train"] > 
            X.coordinate[type == "Reference"]), by = timestamp]

同样适用于Bus_with_higher_X。

对于NA不是type的行，将值保留为"Reference"会有点复杂，但如果您已经承诺，请修复它事后：

DF[type != "Reference", Train_with_higher_X := NA]

此方法还依赖于type的每个值中只有一行行"Reference" timestamp。

Answer 2

作为在单个数据框中混合摘要数据和原始数据的替代方法（因此在摘要列中有大量缺失值），这些选项如何：

library(dplyr)

按时间戳计算公共汽车和火车的数量大于参考值：

dat %>% group_by(timestamp) %>%
  mutate(Reference = X.coordinate[type=="Reference"]) %>%
  filter(type != "Reference") %>%
  group_by(timestamp, type) %>%
  summarise(Reference = unique(Reference), 
            numGTref = sum(X.coordinate > Reference))

  timestamp   type Reference numGTref
1         1    Bus       350        3
2         1  Train       350        2
3         2    Bus        85        4
4         2  Train        85        3

按时间戳标记大于参考的公共汽车和火车：

dat %>% group_by(timestamp) %>%
  mutate(Reference = X.coordinate[type=="Reference"]) %>%
  filter(type != "Reference") %>%
  group_by(timestamp, type) %>%
  mutate(Status = ifelse(X.coordinate > Reference, 
                         paste("Greater than", Reference), 
                         paste("Less than", Reference))) %>%
  select(-Reference)

   timestamp   type X.coordinate           Status
1          1  Train          470 Greater than 350
2          1  Train         -300    Less than 350
3          1  Train           25    Less than 350
4          1  Train          456 Greater than 350
5          1  Train           37    Less than 350
6          1    Bus           19    Less than 350
7          1    Bus          798 Greater than 350
8          1    Bus          -56    Less than 350
9          1    Bus          489 Greater than 350
10         1    Bus          412 Greater than 350
11         2  Train          278  Greater than 85
12         2  Train          970  Greater than 85
13         2  Train          -65     Less than 85
14         2  Train         -894     Less than 85
15         2  Train          780  Greater than 85
16         2    Bus          265  Greater than 85
17         2    Bus          -25     Less than 85
18         2    Bus          365  Greater than 85
19         2    Bus          785  Greater than 85
20         2    Bus           95  Greater than 85

Answer 3

熟悉data.table和dplyr这样的软件包，就像其他答案所暗示的那样，绝对是您宝贵时间的宝贵投资。但是如果你想用更熟悉的工具来解决这个问题，那也是可能的。通过将一些基本R函数mapply，ifelse和with组合到以下代码行中（我已将数据命名为您的数据），您可以在没有任何额外包的情况下执行此操作框架d）

bus_sum <- function(x, y) with(d[d$timestamp == x & d$type == "Bus",], sum(X.coordinate > y))
d$Bus_with_higher_X <- ifelse(d$type == "Reference", mapply(FUN = bus_sum, d$timestamp, d$X.coordinate), NA)

train_sum <- function(x, y) with(d[d$timestamp == x & d$type == "Train",], sum(X.coordinate > y))
d$Train_with_higher_X <- ifelse(d$type == "Reference", mapply(FUN = bus_sum, d$timestamp, d$X.coordinate), NA)

结果几乎就是你想要的，我相信。

   timestamp      type X.coordinate Train_with_higher_X Bus_with_higher_X
1          1     Train          470                  NA                NA
2          1     Train         -300                  NA                NA
3          1     Train           25                  NA                NA
4          1     Train          456                  NA                NA
5          1     Train           37                  NA                NA
6          1       Bus           19                  NA                NA
7          1       Bus          798                  NA                NA
8          1       Bus          -56                  NA                NA
9          1       Bus          489                  NA                NA
10         1       Bus          412                  NA                NA
11         1 Reference          350                   2                 3
12         2     Train          278                  NA                NA
13         2     Train          970                  NA                NA
14         2     Train          -65                  NA                NA
15         2     Train         -894                  NA                NA
16         2     Train          780                  NA                NA
17         2       Bus          265                  NA                NA
18         2       Bus          -25                  NA                NA
19         2       Bus          365                  NA                NA
20         2       Bus          785                  NA                NA
21         2       Bus           95                  NA                NA
22         2 Reference           85                   3                 4

函数bus_sum和train_sum采用时间戳和参考坐标，并计算X.coordinate大于参考坐标的每种类型的行数（使用{{ 1}}仅限于考虑与该时间戳和类型匹配的数据帧部分。这两个函数仅在with vs d$type == "Bus"中有所不同，因此用函数的另一个参数来概括它很容易。将每个函数放在d$type == "Train"中会调用它来获取每行的时间戳和类型。然后将mapply包裹在mapply内，只需手动将所有非参考行设置为ifelse。

R-COUNTIFS功能

3 个答案: