更新:感谢您到目前为止的输入。我重写了这个问题,并添加了一个更好的示例,以突出我的第一个示例未涵盖的隐式要求。
问题
我正在寻找一种通用的tidy
解决方案,以删除包含停用词的ngram。简而言之,ngram是由空格分隔的单词字符串。字母组合包含1个单词,双字母组合包含2个单词,依此类推。我的目标是在使用unnest_tokens()
之后将其应用于数据帧。该解决方案应适用于包含任意长度(uni,bi,tri ..)或至少bi&tri及以上的ngram混合的数据帧。
新示例数据
ngram_df <- tibble::tribble(
~Document, ~ngram,
1, "the",
1, "the basis",
1, "basis",
1, "basis of culture",
1, "culture",
1, "is ground water",
1, "ground water",
1, "ground water treatment"
)
stopword_df <- tibble::tribble(
~word, ~lexicon,
"the", "custom",
"of", "custom",
"is", "custom"
)
desired_output <- tibble::tribble(
~Document, ~ngram,
1, "basis",
1, "culture",
1, "ground water",
1, "ground water treatment"
)
由reprex package(v0.2.1)于2019-03-21创建
期望的行为
ngram_df
中desired_output
列的停用词将word
转换为stopword_df
。is
不应删除basis
)
示例数据
library(tidyverse)
library(tidytext)
df <- "Groundwater remediation is the process that is used to treat polluted groundwater by removing the pollutants or converting them into harmless products." %>%
enframe() %>%
unnest_tokens(ngrams, value, "ngrams", n = 2)
#apply magic here
df
#> # A tibble: 21 x 2
#> name ngrams
#> <int> <chr>
#> 1 1 groundwater remediation
#> 2 1 remediation is
#> 3 1 is the
#> 4 1 the process
#> 5 1 process that
#> 6 1 that is
#> 7 1 is used
#> 8 1 used to
#> 9 1 to treat
#> 10 1 treat polluted
#> # ... with 11 more rows
停用词示例
stopwords <- c("is", "the", "that", "to")
所需的输出
#> Source: local data frame [9 x 2]
#> Groups: <by row>
#>
#> # A tibble: 9 x 2
#> name ngrams
#> <int> <chr>
#> 1 1 groundwater remediation
#> 2 1 treat polluted
#> 3 1 polluted groundwater
#> 4 1 groundwater by
#> 5 1 by removing
#> 6 1 pollutants or
#> 7 1 or converting
#> 8 1 them into
#> 9 1 harmless products
由reprex package(v0.2.1)于2019-03-20创建
(来自https://en.wikipedia.org/wiki/Groundwater_remediation的例句)
答案 0 :(得分:0)
在这里,您还有另一种使用上一个答案中的“ stopwords_collapsed”的方法:
swc <- paste(stopwords, collapse = "|")
df <- df[str_detect(df$ngrams, swc) == FALSE, ] #select rows without stopwords
df
# A tibble: 8 x 2
name ngrams
<int> <chr>
1 1 groundwater remediation
2 1 treat polluted
3 1 polluted groundwater
4 1 groundwater by
5 1 by removing
6 1 pollutants or
7 1 or converting
8 1 harmless products
这里有一个比较两个系统的简单基准:
#benchmark
txtexp <- rep(txt,1000000)
dfexp <- txtexp %>%
enframe() %>%
unnest_tokens(ngrams, value, "ngrams", n = 2)
benchmark("mutate+filter (small text)" = {df1 <- df %>%
mutate(
has_stop_word = str_detect(ngrams, stopwords_collapsed)
) %>%
filter(!has_stop_word)},
"[] row selection (small text)" = {df2 <- df[str_detect(df$ngrams, stopwords_collapsed) == FALSE, ]},
"mutate+filter (large text)" = {df3 <- dfexp %>%
mutate(
has_stop_word = str_detect(ngrams, stopwords_collapsed)
) %>%
filter(!has_stop_word)},
"[] row selection (large text)" = {df4 <- dfexp[str_detect(dfexp$ngrams, stopwords_collapsed) == FALSE, ]},
replications = 5,
columns = c("test", "replications", "elapsed")
)
test replications elapsed
4 [] row selection (large text) 5 30.03
2 [] row selection (small text) 5 0.00
3 mutate+filter (large text) 5 30.64
1 mutate+filter (small text) 5 0.00