我有一张描述火车轨道的表,每条线都是具有from
和to
站以及trackID
和segment
的轨道的一部分, ID。电台名称完全是随机的,而不像此处显示的那样结构化。
tracks <- data.frame(
trackID = c(rep("A",4),rep("B",4)),
segment = letters[1:8],
from = paste0("station_1",1:8),
to = paste0("station_2",1:8)
)
tracks
trackID segment from to
1 A a station_11 station_21
2 A b station_12 station_22
3 A c station_13 station_23
4 A d station_14 station_24
5 B e station_15 station_25
6 B f station_16 station_26
7 B g station_17 station_27
8 B h station_18 station_28
我在这张火车上还有一张桌子上有目击者,我想知道每次目击者所对应的trackID
是什么。该表如下所示:
sightings <- data.frame(from = c("station_24","station_28","station_14"),
to = c("station_14","station_16","station_25"))
sightings
from to
1 station_24 station_14
2 station_28 station_16
3 station_14 station_25
我可以从目击表中提供的trackID
和to
信息中收集有关from
的信息。但是from
表中的to
和sightings
与from
表中的to
和track
不对应:{{ 1}}和from
可以位于不同的段中,并且和可以互换(to
-to
)。在某些有问题的情况下,from
和from
位于不同的to
中,然后将不返回任何匹配项。该示例的期望输出为:
trackID
在我看来,解决方案包括用 from to trackID
1 station_24 station_14 A
2 station_28 station_16 B
3 station_14 station_25 <NA> # no match since station_14 and 25 are from two different trackIDs
折叠tracks
表,然后对字符串进行双部分匹配(使用trackID
?)。接下来的几行会解决折叠的问题,但是我不知道从这儿去哪里。有人可以指出我正确的方向吗?
非常喜欢使用grepl()
/ R
的解决方案,但我会采取任何措施!
dplyr
编辑:在我的最小示例中,我似乎简化了我的问题。主要问题是工作站(library(dplyr)
tracks %>%
group_by(trackID) %>%
summarise(
from_to = paste(paste(from,collapse = ","),paste(to,collapse = ","),sep = ",")
)
tracks
trackID from_to
<fct> <chr>
1 A station_11,station_12,station_13,station_14,station_21,station_22,station_23,station_24
2 B station_15,station_16,station_17,station_18,station_25,station_26,station_27,station_28
和from
)在表中不是唯一的,甚至不是to
唯一的。 trackID
和to
的组合对于from
是唯一的。我已经接受了答案,因为它可以解决上述问题,但同时我也会提供自己的解决方案。
答案 0 :(得分:1)
双向联接可以工作。
注意:您似乎没有使用segment
,因此我在这里将其丢弃,但是如果需要,可以对其进行修改。另外,我在您的数据中添加了stringsAsFactors=FALSE
,因为否则组合factor
的向量可能会有问题。)
library(dplyr)
tracksmod <- bind_rows(
select(tracks, trackID, sta=from),
select(tracks, trackID, sta=to)
)
head(tracksmod)
# trackID sta
# 1 A station_11
# 2 A station_12
# 3 A station_13
# 4 A station_14
# 5 B station_15
# 6 B station_16
sightings %>%
left_join(select(tracksmod, trackID, from=sta), by="from") %>%
left_join(select(tracksmod, trackID2=trackID, to=sta), by="to") %>%
mutate(trackID = if_else(trackID == trackID2, trackID, NA_character_)) %>%
select(-trackID2)
# from to trackID
# 1 station_24 station_14 A
# 2 station_28 station_16 B
# 3 station_14 station_25 <NA>
我不认为方向性很重要。也就是说,我不认为from
中列出的电台必须始终在from
列中。这就是为什么我将tracks
转换为tracksmod
以便识别具有ID(与方向无关)的电台的原因。
答案 1 :(得分:0)
正如我在问题的EDIT中所述,我在最小的Example中过分简化了我的问题。
这是数据的更新版本,与我的数据更准确。正如@ r2evans所说,我还添加了stringsAsFactor = F
。
tracks <- data.frame(
trackID = c(rep("A",4),rep("B",4)),
segment = letters[1:8],
from = paste0("station_1",c(1:4,1,2,5,6)),
to = paste0("station_2",1:8),
stringsAsFactors = F
)
sightings <- data.frame(
from = c("station_24","station_28","station_14"),
to = c("station_14","station_11","station_25"),
trackID = c("A","B",NA),
stringsAsFactors = F
)
我通过在tracks
的基础上折叠trackID
表,然后使用purrr
包以嵌套方式使用循环函数来解决了这个问题。
library(dplyr)
# Collapsing the tracks-dataframe
tracks_collapse <- tracks %>%
group_by(trackID) %>%
summarise(
from_to = paste(paste(from,collapse = ","),paste(to,collapse = ","),sep = ",")
# from = list(from),
# to = list(to),
# stas = list(c(from,to))
)
# a helper function to remove NAs when looking for matches
remove_na <- function(x){x[!is.na(x)]}
library(purrr)
pmap_dfr(sightings, function(from,to,trackID){ # pmap_dfr runs over a data.frame and returns a data.frame
data.frame(
from = from, # recreates the sightings data.frame
to = to, # dito
trackID = paste( # collapses the resulting vector
remove_na( # removes the NA values
pmap_chr( # matches every row from the sightings-data.frame with the tracks-data.frame
tracks_collapse,
function(trackID,from_to){
ifelse(grepl(from,from_to) & grepl(to,from_to),trackID,NA) # does partial string matching and returns the trackID if both strings match
}
)
),collapse = ","
)
)
})
输出:
from to trackID
1 station_24 station_14 A
2 station_28 station_11 B
3 station_14 station_25 <NA>