Question

我需要获取R中包含空格的特定单词。

例如，我有一个像这样的餐厅清单：

r_list <- c('mexicana', 'macdonald', 'KFC')

并且我有一个句子列表来解释有关这些餐厅的信息，例如：

sentense <- c('I really like mexi cana', 'want to eat mac donaldso much!', 'I hateKF C')

最终我想使用for循环来增加每家餐厅的排位。

grep('mexicana', sentense)

当我grep mexicana时，我无法grep它。

所以我想使用trie算法，但是当我将其与韩文一起使用时，'triebeard'软件包不起作用。

我希望你们能帮助我。我该怎么办？？

是否只有'gsub'功能？

Answer 1

您可以尝试以下方法：

我的想法是删除sentense，大写sentense和r_list中的所有空格（以便于匹配），并使用grep进行匹配。

样本数据：

r_list <- c('mexicana', 'macdonald', 'KFC')

sentense <- c('I really like mexi cana', 'want to eat mac donaldso much!', 'I hateKF C')

解决方案：

require(tidyverse)

sentense %>% 
  tbl_df() %>%
  mutate(concatenate = toupper(gsub("[[:space:]]", "", value)), 
         eating = grep(
           pattern = paste(
             toupper(r_list),
             collapse = "|"), 
           x = concatenate), 
         eating = r_list[eating])

输出：

# A tibble: 3 x 3
  value                          concatenate               eating   
  <chr>                          <chr>                     <chr>    
1 I really like mexi cana        IREALLYLIKEMEXICANA       mexicana 
2 want to eat mac donaldso much! WANTTOEATMACDONALDSOMUCH! macdonald
3 I hateKF C                     IHATEKFC                  KFC

Answer 2

由于要通过正则表达式进行提取，因此可以使用gregexpr和regmatches。

( nospaces <- gsub("\\s", "", sentense) )
# [1] "Ireallylikemexicana"       "wanttoeatmacdonaldsomuch!" "IhateKFC"                 

re <- gregexpr(paste(r_list, collapse = "|"), nospaces)
regmatches(nospaces, re)
# [[1]]
# [1] "mexicana"
# [[2]]
# [1] "macdonald"
# [[3]]
# [1] "KFC"

因此gregexpr的返回值是一个具有以下属性的列表：

str(re)
# List of 3
#  $ : int 12
#   ..- attr(*, "match.length")= int 8
#   ..- attr(*, "index.type")= chr "chars"
#   ..- attr(*, "useBytes")= logi TRUE
#  $ : int 10
#   ..- attr(*, "match.length")= int 9
#   ..- attr(*, "index.type")= chr "chars"
#   ..- attr(*, "useBytes")= logi TRUE
#  $ : int 6
#   ..- attr(*, "match.length")= int 3
#   ..- attr(*, "index.type")= chr "chars"
#   ..- attr(*, "useBytes")= logi TRUE

在列表中，第一个[[1]]元素用于第一个字符串"Ireallylikemexicana"，以此类推。在该列表中，12表示从第12个字符开始有一个匹配项，它是8长字符。对其他人重复。

这将匹配并在单个字符串中提取多个匹配项。

others <- c("quuxmexicanaoKFCmmmsdkfj", "quux")
str(re <- gregexpr(paste(r_list, collapse = "|"), others))
# List of 2
#  $ : int [1:2] 5 14
#   ..- attr(*, "match.length")= int [1:2] 8 3
#   ..- attr(*, "index.type")= chr "chars"
#   ..- attr(*, "useBytes")= logi TRUE
#  $ : int -1
#   ..- attr(*, "match.length")= int -1
#   ..- attr(*, "index.type")= chr "chars"
#   ..- attr(*, "useBytes")= logi TRUE
str(regmatches(others, re))
# List of 2
#  $ : chr [1:2] "mexicana" "KFC"
#  $ : chr(0)

在这种情况下，第二个列表元素（对于"quux"）为-1，表示找不到匹配项。这将导致列表的第二个位置有一个空（character(0)）占位符。在这种情况下，您可以使用unlist获得所有匹配项，无论哪个字符串。

在句子中提取R中包含空格的特定单词

2 个答案: