对于每个时间戳,我想计算坐标高于参考值的火车/公共汽车的数量。 请参见下面的数据集:
structure(list(timestamp = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), type = structure(c(3L,
3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 1L,
1L, 1L, 1L, 1L, 2L), .Label = c("Bus", "Reference", "Train"), class = "factor"),
X.coordinate = c(470L, -300L, 25L, 456L, 37L, 19L, 798L,
-56L, 489L, 412L, 350L, 278L, 970L, -65L, -894L, 780L, 265L,
-25L, 365L, 785L, 95L, 85L)), .Names = c("timestamp", "type",
"X.coordinate"), row.names = c(NA, -22L), class = "data.frame")
我想添加2列。一列为每行提供"引用"坐标高于参考值的列车数量。第二列应该给出坐标高于参考值的总线数量。
见下文所需的输出:
timestamp type X.coordinate Train_with_higher_X Bus_with_higher_X
1 1 Train 470 NA NA
2 1 Train -300 NA NA
3 1 Train 25 NA NA
4 1 Train 456 NA NA
5 1 Train 37 NA NA
6 1 Bus 19 NA NA
7 1 Bus 798 NA NA
8 1 Bus -56 NA NA
9 1 Bus 489 NA NA
10 1 Bus 412 NA NA
11 1 Reference 350 2 3
12 2 Train 278 NA NA
13 2 Train 970 NA NA
14 2 Train -65 NA NA
15 2 Train -894 NA NA
16 2 Train 780 NA NA
17 2 Bus 265 NA NA
18 2 Bus -25 NA NA
19 2 Bus 365 NA NA
20 2 Bus 785 NA NA
21 2 Bus 95 NA NA
22 2 Reference 85 3 4
我尝试在R中编写一些代码以优雅的方式来实现这一点,但我的编程技巧目前来说太有限了。有没有人知道这个问题的解决方案?先谢谢你了!
答案 0 :(得分:1)
这可以解决问题(使用data.table
):
library(data.table)
setDT(DF)
DF[ , Train_with_higher_X :=
sum(X.coordinate[type == "Train"] >
X.coordinate[type == "Reference"]), by = timestamp]
同样适用于Bus_with_higher_X
。
对于NA
不是type
的行,将值保留为"Reference"
会有点复杂,但如果您已经承诺,请修复它事后:
DF[type != "Reference", Train_with_higher_X := NA]
此方法还依赖于type
的每个值中只有一行行"Reference"
timestamp
。
答案 1 :(得分:1)
作为在单个数据框中混合摘要数据和原始数据的替代方法(因此在摘要列中有大量缺失值),这些选项如何:
library(dplyr)
按时间戳计算公共汽车和火车的数量大于参考值:
dat %>% group_by(timestamp) %>%
mutate(Reference = X.coordinate[type=="Reference"]) %>%
filter(type != "Reference") %>%
group_by(timestamp, type) %>%
summarise(Reference = unique(Reference),
numGTref = sum(X.coordinate > Reference))
timestamp type Reference numGTref
1 1 Bus 350 3
2 1 Train 350 2
3 2 Bus 85 4
4 2 Train 85 3
按时间戳标记大于参考的公共汽车和火车:
dat %>% group_by(timestamp) %>%
mutate(Reference = X.coordinate[type=="Reference"]) %>%
filter(type != "Reference") %>%
group_by(timestamp, type) %>%
mutate(Status = ifelse(X.coordinate > Reference,
paste("Greater than", Reference),
paste("Less than", Reference))) %>%
select(-Reference)
timestamp type X.coordinate Status
1 1 Train 470 Greater than 350
2 1 Train -300 Less than 350
3 1 Train 25 Less than 350
4 1 Train 456 Greater than 350
5 1 Train 37 Less than 350
6 1 Bus 19 Less than 350
7 1 Bus 798 Greater than 350
8 1 Bus -56 Less than 350
9 1 Bus 489 Greater than 350
10 1 Bus 412 Greater than 350
11 2 Train 278 Greater than 85
12 2 Train 970 Greater than 85
13 2 Train -65 Less than 85
14 2 Train -894 Less than 85
15 2 Train 780 Greater than 85
16 2 Bus 265 Greater than 85
17 2 Bus -25 Less than 85
18 2 Bus 365 Greater than 85
19 2 Bus 785 Greater than 85
20 2 Bus 95 Greater than 85
答案 2 :(得分:1)
熟悉data.table
和dplyr
这样的软件包,就像其他答案所暗示的那样,绝对是您宝贵时间的宝贵投资。但是如果你想用更熟悉的工具来解决这个问题,那也是可能的。通过将一些基本R函数mapply
,ifelse
和with
组合到以下代码行中(我已将数据命名为您的数据),您可以在没有任何额外包的情况下执行此操作框架d
)
bus_sum <- function(x, y) with(d[d$timestamp == x & d$type == "Bus",], sum(X.coordinate > y))
d$Bus_with_higher_X <- ifelse(d$type == "Reference", mapply(FUN = bus_sum, d$timestamp, d$X.coordinate), NA)
train_sum <- function(x, y) with(d[d$timestamp == x & d$type == "Train",], sum(X.coordinate > y))
d$Train_with_higher_X <- ifelse(d$type == "Reference", mapply(FUN = bus_sum, d$timestamp, d$X.coordinate), NA)
结果几乎就是你想要的,我相信。
timestamp type X.coordinate Train_with_higher_X Bus_with_higher_X
1 1 Train 470 NA NA
2 1 Train -300 NA NA
3 1 Train 25 NA NA
4 1 Train 456 NA NA
5 1 Train 37 NA NA
6 1 Bus 19 NA NA
7 1 Bus 798 NA NA
8 1 Bus -56 NA NA
9 1 Bus 489 NA NA
10 1 Bus 412 NA NA
11 1 Reference 350 2 3
12 2 Train 278 NA NA
13 2 Train 970 NA NA
14 2 Train -65 NA NA
15 2 Train -894 NA NA
16 2 Train 780 NA NA
17 2 Bus 265 NA NA
18 2 Bus -25 NA NA
19 2 Bus 365 NA NA
20 2 Bus 785 NA NA
21 2 Bus 95 NA NA
22 2 Reference 85 3 4
函数bus_sum
和train_sum
采用时间戳和参考坐标,并计算X.coordinate
大于参考坐标的每种类型的行数(使用{{ 1}}仅限于考虑与该时间戳和类型匹配的数据帧部分。这两个函数仅在with
vs d$type == "Bus"
中有所不同,因此用函数的另一个参数来概括它很容易。将每个函数放在d$type == "Train"
中会调用它来获取每行的时间戳和类型。然后将mapply
包裹在mapply
内,只需手动将所有非参考行设置为ifelse
。