通过多个部分匹配来联接表

时间:2018-07-17 22:51:37

标签: r join dplyr

我有一张描述火车轨道的表,每条线都是具有fromto站以及trackIDsegment的轨道的一部分, ID。电台名称完全是随机的,而不像此处显示的那样结构化。

tracks <- data.frame(
  trackID = c(rep("A",4),rep("B",4)),
  segment = letters[1:8],
  from = paste0("station_1",1:8),
  to = paste0("station_2",1:8)
  )

tracks 

  trackID segment       from         to
1       A       a station_11 station_21
2       A       b station_12 station_22
3       A       c station_13 station_23
4       A       d station_14 station_24
5       B       e station_15 station_25
6       B       f station_16 station_26
7       B       g station_17 station_27
8       B       h station_18 station_28

我在这张火车上还有一张桌子上有目击者,我想知道每次目击者所对应的trackID是什么。该表如下所示:

sightings <- data.frame(from = c("station_24","station_28","station_14"),
                    to = c("station_14","station_16","station_25"))

sightings 

        from         to
1 station_24 station_14
2 station_28 station_16
3 station_14 station_25

可以从目击表中提供的trackIDto信息中收集有关from的信息。但是from表中的tosightingsfrom表中的totrack不对应:{{ 1}}和from可以位于不同的段中,并且可以互换(to-to)。在某些有问题的情况下,fromfrom位于不同的to中,然后将不返回任何匹配项。该示例的期望输出为:

trackID

在我看来,解决方案包括用 from to trackID 1 station_24 station_14 A 2 station_28 station_16 B 3 station_14 station_25 <NA> # no match since station_14 and 25 are from two different trackIDs 折叠tracks表,然后对字符串进行双部分匹配(使用trackID?)。接下来的几行会解决折叠的问题,但是我不知道从这儿去哪里。有人可以指出我正确的方向吗?

非常喜欢使用grepl() / R的解决方案,但我会采取任何措施!

dplyr

编辑:在我的最小示例中,我似乎简化了我的问题。主要问题是工作站(library(dplyr) tracks %>% group_by(trackID) %>% summarise( from_to = paste(paste(from,collapse = ","),paste(to,collapse = ","),sep = ",") ) tracks trackID from_to <fct> <chr> 1 A station_11,station_12,station_13,station_14,station_21,station_22,station_23,station_24 2 B station_15,station_16,station_17,station_18,station_25,station_26,station_27,station_28 from)在表中不是唯一的,甚至不是to唯一的。 trackIDto组合对于from是唯一的。我已经接受了答案,因为它可以解决上述问题,但同时我也会提供自己的解决方案。

2 个答案:

答案 0 :(得分:1)

双向联接可以工作。

注意:您似乎没有使用segment,因此我在这里将其丢弃,但是如果需要,可以对其进行修改。另外,我在您的数据中添加了stringsAsFactors=FALSE,因为否则组合factor的向量可能会有问题。)

library(dplyr)

tracksmod <- bind_rows(
  select(tracks, trackID, sta=from),
  select(tracks, trackID, sta=to)
)
head(tracksmod)
#   trackID        sta
# 1       A station_11
# 2       A station_12
# 3       A station_13
# 4       A station_14
# 5       B station_15
# 6       B station_16

sightings %>%
  left_join(select(tracksmod, trackID, from=sta), by="from") %>%
  left_join(select(tracksmod, trackID2=trackID, to=sta), by="to") %>%
  mutate(trackID = if_else(trackID == trackID2, trackID, NA_character_)) %>%
  select(-trackID2)
#         from         to trackID
# 1 station_24 station_14       A
# 2 station_28 station_16       B
# 3 station_14 station_25    <NA>

我不认为方向性很重要。也就是说,我不认为from中列出的电台必须始终在from列中。这就是为什么我将tracks转换为tracksmod以便识别具有ID(与方向无关)的电台的原因。

答案 1 :(得分:0)

正如我在问题的EDIT中所述,我在最小的Example中过分简化了我的问题。 这是数据的更新版本,与我的数据更准确。正如@ r2evans所说,我还添加了stringsAsFactor = F

tracks <- data.frame(
  trackID = c(rep("A",4),rep("B",4)),
  segment = letters[1:8],
  from = paste0("station_1",c(1:4,1,2,5,6)),
  to = paste0("station_2",1:8),
  stringsAsFactors = F
  )

sightings <- data.frame(
  from = c("station_24","station_28","station_14"),
  to = c("station_14","station_11","station_25"),
  trackID = c("A","B",NA),
  stringsAsFactors = F
)

我通过在tracks的基础上折叠trackID表,然后使用purrr包以嵌套方式使用循环函数来解决了这个问题。

library(dplyr)

# Collapsing the tracks-dataframe
tracks_collapse <- tracks %>%
  group_by(trackID) %>%
  summarise(
    from_to = paste(paste(from,collapse = ","),paste(to,collapse = ","),sep = ",")
    # from = list(from),
    # to = list(to),
    # stas = list(c(from,to))
    )

# a helper function to remove NAs when looking for matches
remove_na <- function(x){x[!is.na(x)]}

library(purrr)


pmap_dfr(sightings, function(from,to,trackID){                         # pmap_dfr runs over a data.frame and returns a data.frame
  data.frame(
    from = from,                                                       # recreates the sightings data.frame
    to = to,                                                           # dito
    trackID = paste(                                                   # collapses the resulting vector
      remove_na(                                                       # removes the NA values
        pmap_chr(                                                      # matches every row from the sightings-data.frame with the tracks-data.frame
          tracks_collapse,
          function(trackID,from_to){
            ifelse(grepl(from,from_to) & grepl(to,from_to),trackID,NA) # does partial string matching and returns the trackID if both strings match
            }
          )
        ),collapse = ","
      )
    )
  }) 

输出:

        from         to trackID
1 station_24 station_14       A
2 station_28 station_11       B
3 station_14 station_25    <NA>