Question

我有一个像这样的字符串向量：

strings <- tibble(string = c("apple, orange, plum, tomato",
                             "plum, beat, pear, cactus",
                             "centipede, toothpick, pear, fruit"))

我有水果的载体：

fruits <- tibble(fruit =c("apple", "orange", "plum", "pear"))

我想要的是带有原始strings data.frame的data.frame / tibble，其中包含原始列中包含的所有水果的第二个列表或字符列。像这样的东西。

strings <- tibble(string = c("apple, orange, plum, tomato",
                             "plum, beat, pear, cactus",
                             "centipede, toothpick, pear, fruit"),
                   match = c("apple, orange, plum",
                             "plum, pear",
                             "pear")
                  )

我尝试过str_extract(strings, fruits)并得到一个列表，其中所有内容都是空白以及警告：

Warning message:
In stri_detect_regex(string, pattern, opts_regex = opts(pattern)):
longer object length is not a multiple of shorter object length

我已经尝试过str_extract_all(strings, paste0(fruits, collapse = "|"))，并且得到的消息也相同。

我已经看过这个Find matches of a vector of strings in another vector of strings，但这似乎无济于事。

任何帮助将不胜感激。

Answer 1

这里是一种选择。首先，我们将string列的每一行拆分为单独的字符串（现在"apple, orange, plum, tomato"都是一个字符串）。然后，我们将字符串列表与fruits$fruit列的内容进行比较，并将匹配值的列表存储在新的fruits列中。

library("tidyverse")
strings <- tibble(
  string = c(
    "apple, orange, plum, tomato",
    "plum, beat, pear, cactus",
    "centipede, toothpick, pear, fruit"
  )
)

fruits <- tibble(fruit =c("apple", "orange", "plum", "pear"))

strings %>%
  mutate(str2 = str_split(string, ", ")) %>%
  rowwise() %>%
  mutate(fruits = list(intersect(str2, fruits$fruit)))
#> Source: local data frame [3 x 3]
#> Groups: <by row>
#> 
#> # A tibble: 3 x 3
#>   string                            str2      fruits   
#>   <chr>                             <list>    <list>   
#> 1 apple, orange, plum, tomato       <chr [4]> <chr [3]>
#> 2 plum, beat, pear, cactus          <chr [4]> <chr [2]>
#> 3 centipede, toothpick, pear, fruit <chr [4]> <chr [1]>

由reprex package（v0.2.0）于2018-08-07创建。

Answer 2

这是使用purrr的示例

strings <- tibble(string = c("apple, orange, plum, tomato",
                         "plum, beat, pear, cactus",
                         "centipede, toothpick, pear, fruit"))

fruits <- tibble(fruit =c("apple", "orange", "plum", "pear"))

extract_if_exists <- function(string_to_parse, pattern){
  extraction <- stringi::stri_extract_all_regex(string_to_parse, pattern)
  extraction <- unlist(extraction[!(is.na(extraction))])
  return(extraction)
}

strings %>%
  mutate(matches = map(string, extract_if_exists, fruits$fruit)) %>%
  mutate(matches = map(string, str_c, collapse=", ")) %>%
  unnest

Answer 3

这是base-R解决方案：

strings[["match"]] <- 
  sapply(
    strsplit(strings[["string"]], ", "), 
    function(x) {
      paste(x[x %in% fruits[["fruit"]]], collapse = ", ")
    }
  )

结果：

  string                            match              
  <chr>                             <chr>              
1 apple, orange, plum, tomato       apple, orange, plum
2 plum, beat, pear, cactus          plum, pear         
3 centipede, toothpick, pear, fruit pear

如何使用R提取另一个字符串向量中字符串向量的外观？

3 个答案: