我有一个带有单词的长数据框。我想使用多个特定单词来查找每个词性词。
例如:
df <- data.frame(word = c("clean", "grinding liquid cmp", "cleaning",
"cleaning composition", "supplying", "supply", "supplying cmp
abrasive", "chemical mechanical"))
words
1 clean
2 grinding liquid cmp
3 cleaning
4 cleaning composition
5 supplying
6 supply
7 supplying cmp abrasive
8 chemical mechanical
我想提取&#34;清洁&#34;和&#34;供应&#34;单词与不同的POS。我尝试使用grep()
函数来完成。
specific_word <- c("clean", "supply")
grep_onto <- df_1[grepl(paste(ontoword_apparatus, collapse = "|"), df_1$word), ] %>%
data.frame(word = ., row.names = NULL) %>%
unique()
但结果不是我想要的:
word
1 cleans
2 grinding liquid cmp
3 cleaning
4 cleaning composition
5 supplying
6 supply
7 supplying cmp abrasive
8 chemical mechanical
我更喜欢
words
1 clean
2 cleaning
3 supplying
4 supply
我知道也许正则表达式可以解决我的问题,但我不知道如何定义它。谁能给我一些建议?
答案 0 :(得分:2)
有多种方法可以执行此操作,但通常如果您希望它只是一个单词并且您正在使用正则表达式,则需要指定开头^
和结束$
这条线是为了限制你的模式之前或之后会发生什么。您似乎希望它能够使用更多字母进行扩展,因此请添加\\w*
以允许它:
df <- data.frame(word = c("clean", "grinding liquid cmp", "cleaning",
"cleaning composition", "supplying", "supply",
"supplying cmp abrasive", "chemical mechanical"))
specific_word <- c("clean", "supply")
pattern <- paste0('^\\w*', specific_word, '\\w*$', collapse = '|')
pattern
#> [1] "^\\w*clean\\w*$|^\\w*supply\\w*$"
df[grep(pattern, df$word), , drop = FALSE] # drop = FALSE to stop simplification to vector
#> word
#> 1 clean
#> 3 cleaning
#> 5 supplying
#> 6 supply
您正在寻找的另一种解释是将每个词分成单个词,并搜索其中任何一个词以进行匹配。 tidyr::separate_rows
可用于此类拆分,然后filter
grepl
可以使用library(tidyverse)
df <- data_frame(word = c("clean", "grinding liquid cmp", "cleaning",
"cleaning composition", "supplying", "supply",
"supplying cmp abrasive", "chemical mechanical"))
specific_word <- c("clean", "supply")
df %>% separate_rows(word) %>%
filter(grepl(paste(specific_word, collapse = '|'), word)) %>%
distinct()
#> # A tibble: 4 x 1
#> word
#> <chr>
#> 1 clean
#> 2 cleaning
#> 3 supplying
#> 4 supply
:
tidytext::unnest_tokens
对于更强大的单词标记化,请尝试ln -s /usr/local/Cellar/jpeg/8d/lib/libjpeg.8.dylib /usr/local/opt/jpeg/lib/libjpeg.8.dyli
或其他单词实际单词标记化。