如何使用R提取另一个字符串向量中字符串向量的外观?

时间:2018-08-07 19:10:57

标签: r regex stringr stringi

我有一个像这样的字符串向量:

strings <- tibble(string = c("apple, orange, plum, tomato",
                             "plum, beat, pear, cactus",
                             "centipede, toothpick, pear, fruit"))

我有水果的载体:

fruits <- tibble(fruit =c("apple", "orange", "plum", "pear"))

我想要的是带有原始strings data.frame的data.frame / tibble,其中包含原始列中包含的所有水果的第二个列表或字符列。像这样的东西。

strings <- tibble(string = c("apple, orange, plum, tomato",
                             "plum, beat, pear, cactus",
                             "centipede, toothpick, pear, fruit"),
                   match = c("apple, orange, plum",
                             "plum, pear",
                             "pear")
                  )

我尝试过str_extract(strings, fruits)并得到一个列表,其中所有内容都是空白以及警告:

Warning message:
In stri_detect_regex(string, pattern, opts_regex = opts(pattern)):
longer object length is not a multiple of shorter object length

我已经尝试过str_extract_all(strings, paste0(fruits, collapse = "|")),并且得到的消息也相同。

我已经看过这个Find matches of a vector of strings in another vector of strings,但这似乎无济于事。

任何帮助将不胜感激。

3 个答案:

答案 0 :(得分:2)

这里是一种选择。首先,我们将string列的每一行拆分为单独的字符串(现在"apple, orange, plum, tomato"都是一个字符串)。然后,我们将字符串列表与fruits$fruit列的内容进行比较,并将匹配值的列表存储在新的fruits列中。

library("tidyverse")
strings <- tibble(
  string = c(
    "apple, orange, plum, tomato",
    "plum, beat, pear, cactus",
    "centipede, toothpick, pear, fruit"
  )
)

fruits <- tibble(fruit =c("apple", "orange", "plum", "pear"))

strings %>%
  mutate(str2 = str_split(string, ", ")) %>%
  rowwise() %>%
  mutate(fruits = list(intersect(str2, fruits$fruit)))
#> Source: local data frame [3 x 3]
#> Groups: <by row>
#> 
#> # A tibble: 3 x 3
#>   string                            str2      fruits   
#>   <chr>                             <list>    <list>   
#> 1 apple, orange, plum, tomato       <chr [4]> <chr [3]>
#> 2 plum, beat, pear, cactus          <chr [4]> <chr [2]>
#> 3 centipede, toothpick, pear, fruit <chr [4]> <chr [1]>

reprex package(v0.2.0)于2018-08-07创建。

答案 1 :(得分:2)

这是使用purrr的示例

strings <- tibble(string = c("apple, orange, plum, tomato",
                         "plum, beat, pear, cactus",
                         "centipede, toothpick, pear, fruit"))

fruits <- tibble(fruit =c("apple", "orange", "plum", "pear"))

extract_if_exists <- function(string_to_parse, pattern){
  extraction <- stringi::stri_extract_all_regex(string_to_parse, pattern)
  extraction <- unlist(extraction[!(is.na(extraction))])
  return(extraction)
}

strings %>%
  mutate(matches = map(string, extract_if_exists, fruits$fruit)) %>%
  mutate(matches = map(string, str_c, collapse=", ")) %>%
  unnest

答案 2 :(得分:1)

这是base-R解决方案:

strings[["match"]] <- 
  sapply(
    strsplit(strings[["string"]], ", "), 
    function(x) {
      paste(x[x %in% fruits[["fruit"]]], collapse = ", ")
    }
  )

结果:

  string                            match              
  <chr>                             <chr>              
1 apple, orange, plum, tomato       apple, orange, plum
2 plum, beat, pear, cactus          plum, pear         
3 centipede, toothpick, pear, fruit pear