Question

我有一张包含航班ID，到达和离开的表格：

> test
   arrival departure flight_id
1                  9      2233
2                  8      1982
3        1                2164
4                  9      2081
5                         2130
6        2                2040
7        9                2030
8                         2130
9                  4      3169
10       6                2323
11                 8      2130
12                        2220
13                        3169
14                 9      2204
15       1                1910
16                 2       837
17                        1994
18       9         8      1994
19                        1994
20                        1994
21       9         1      2338
22       1         8      1981
23       9                2365
24                 8      2231
25       9                2048

我的目标是只计算到达和离开为空的行，然后按flight_id进行汇总。但是有一个问题！我认为无法通过table()，aggregate()或rle()进行此操作，因为它们不会考虑中断。

例如，只计算连续航班ID，其中到达=“”和出发=“”应计算，如果发生非空值的航班ID，则计数应从零开始。 注意：其间出现的其他航班ID无关紧要 - 每个航班ID应分开处理，这就是航班2130计算两次的原因。

换句话说，test的结果输出应该如下所示：

output
  flight_id count
1      2130     2
2      2220     1
3      3169     1
4      1994     1
5      1994     2

请注意，航班ID 1994发生三次，其中到达和离开是空白但在第18行之间有中断。因此，航班ID必须计算两次。

我尝试编写for循环，但收到错误missing value where TRUE/FALSE needed：

raw_data = test
unique_id = unique(raw_data$flight_id)

output<- data.frame("flight_id"= integer(0), "count" = integer(0), stringsAsFactors=FALSE)

for (flight_id in unique_id)
{
  oneflight <- raw_data[ which(raw_data$flight_id == flight_id), ]

  if(nrow(oneflight) >= 1 ){
    for(i in 2:nrow(oneflight)) {
      if(oneflight[i,"arrival"] == "" & oneflight[i,"departure"] == "") {
        new_row <- c(flight_id, sum(flight_id)[i])
        output[nrow(output) + 1,] = new_row
      }
    }
  }
}

如何改进上述代码，或者有人建议使用dplyr更快的方法？以下是数据样本：

> dput(test)
structure(list(arrival = c("", "", "1", "", "", "2", "9", "", 
"", "6", "", "", "", "", "1", "", "", "9", "", "", "9", "1", 
"9", "", "9"), departure = c("9", "8", "", "9", "", "", "", "", 
"4", "", "8", "", "", "9", "", "2", "", "8", "", "", "1", "8", 
"", "8", ""), flight_id = c(2233, 1982, 2164, 2081, 2130, 2040, 
2030, 2130, 3169, 2323, 2130, 2220, 3169, 2204, 1910, 837, 1994, 
1994, 1994, 1994, 2338, 1981, 2365, 2231, 2048)), .Names = c("arrival", 
"departure", "flight_id"), row.names = c(NA, 25L), class = "data.frame")

Answer 1

如果我理解你的问题，你可以使用的一个技巧是为flight_id添加一个抽象，表示一个组。

例如，获取索引向量

 i <- find(oneflight$arrival == "" & oneflight$departure =="")

然后取cumsum（1-diff（i））/ 100或10的足够大的功率，将其添加到航班ID，然后您可以使用table（）

计算组飞行

Answer 2

以下是使用data.table的解决方案：

library(data.table)
flights <- test$flight_id[test$arrival=="" & test$departure==""]

setDT(test)[flight_id %in% flights, grp := rleid(arrival=="",departure=="")][
    arrival=="" & departure=="",.(count = .N),.(flight_id, grp)]
#   flight_id grp count
#1:      2130   1     2
#2:      2220   3     1
#3:      3169   3     1
#4:      1994   3     1
#5:      1994   5     2

<强>解释

首先，我们获得至少有一条空flight_id和arrival值记录的departure。然后，我们使用此向量flights对您的数据进行子集化，并根据名为arrival==""的{{1}}和departure ==""生成一个游程ID列。最后，我们生成记录的计数（即。"grp"），其中.N按列arrival=="" & departure ==""和flight_id分组。

如果需要，您可以删除grp列。

对多列中的连续字符串求和

2 个答案: