R:如何使用grep()查找特定的单词?

时间:2017-08-15 03:03:10

标签: r regex

我有一个带有单词的长数据框。我想使用多个特定单词来查找每个词性词。

例如:

df <- data.frame(word = c("clean", "grinding liquid cmp", "cleaning", 
                          "cleaning composition", "supplying", "supply", "supplying cmp 
                          abrasive", "chemical mechanical"))

words
1 clean
2 grinding liquid cmp
3 cleaning
4 cleaning composition
5 supplying
6 supply
7 supplying cmp abrasive
8 chemical mechanical

我想提取&#34;清洁&#34;和&#34;供应&#34;单词与不同的POS。我尝试使用grep()函数来完成。

specific_word <- c("clean", "supply")

grep_onto <- df_1[grepl(paste(ontoword_apparatus, collapse = "|"), df_1$word), ] %>%
    data.frame(word = ., row.names = NULL) %>%
    unique()

但结果不是我想要的:

  word
1 cleans
2 grinding liquid cmp
3 cleaning
4 cleaning composition
5 supplying
6 supply
7 supplying cmp abrasive
8 chemical mechanical

我更喜欢

words
1 clean
2 cleaning
3 supplying
4 supply

我知道也许正则表达式可以解决我的问题,但我不知道如何定义它。谁能给我一些建议?

1 个答案:

答案 0 :(得分:2)

有多种方法可以执行此操作,但通常如果您希望它只是一个单词并且您正在使用正则表达式,则需要指定开头^和结束$这条线是为了限制你的模式之前或之后会发生什么。您似乎希望它能够使用更多字母进行扩展,因此请添加\\w*以允许它:

df <- data.frame(word = c("clean", "grinding liquid cmp", "cleaning", 
                          "cleaning composition", "supplying", "supply", 
                          "supplying cmp abrasive", "chemical mechanical"))

specific_word <- c("clean", "supply")
pattern <- paste0('^\\w*', specific_word, '\\w*$', collapse = '|')

pattern
#> [1] "^\\w*clean\\w*$|^\\w*supply\\w*$"

df[grep(pattern, df$word), , drop = FALSE]    # drop = FALSE to stop simplification to vector
#>        word
#> 1     clean
#> 3  cleaning
#> 5 supplying
#> 6    supply

您正在寻找的另一种解释是将每个词分成单个词,并搜索其中任何一个词以进行匹配。 tidyr::separate_rows可用于此类拆分,然后filter grepl可以使用library(tidyverse) df <- data_frame(word = c("clean", "grinding liquid cmp", "cleaning", "cleaning composition", "supplying", "supply", "supplying cmp abrasive", "chemical mechanical")) specific_word <- c("clean", "supply") df %>% separate_rows(word) %>% filter(grepl(paste(specific_word, collapse = '|'), word)) %>% distinct() #> # A tibble: 4 x 1 #> word #> <chr> #> 1 clean #> 2 cleaning #> 3 supplying #> 4 supply

tidytext::unnest_tokens

对于更强大的单词标记化,请尝试ln -s /usr/local/Cellar/jpeg/8d/lib/libjpeg.8.dylib /usr/local/opt/jpeg/lib/libjpeg.8.dyli 或其他单词实际单词标记化。