Question

更新：感谢您到目前为止的输入。我重写了这个问题，并添加了一个更好的示例，以突出我的第一个示例未涵盖的隐式要求。

问题我正在寻找一种通用的tidy解决方案，以删除包含停用词的ngram。简而言之，ngram是由空格分隔的单词字符串。字母组合包含1个单词，双字母组合包含2个单词，依此类推。我的目标是在使用unnest_tokens()之后将其应用于数据帧。该解决方案应适用于包含任意长度（uni，bi，tri ..）或至少bi＆tri及以上的ngram混合的数据帧。

有关ngram的更多信息，请参见Wiki：https://en.wikipedia.org/wiki/N-gram
我知道这个问题：Remove ngrams with leading and trailing stopwords。但是，我正在寻找一种通用的解决方案，该解决方案不需要将停用词放在开头或结尾，并且可以很好地扩展。
如评论中所指出的，这里记录了一种针对二元组的解决方案：https://www.tidytextmining.com/ngrams.html#counting-and-filtering-n-grams

新示例数据

ngram_df <- tibble::tribble(
  ~Document,                   ~ngram,
          1,                    "the",
          1,              "the basis",
          1,                  "basis",
          1,       "basis of culture",
          1,                "culture",
          1,        "is ground water",
          1,           "ground water",
          1, "ground water treatment"
  )
stopword_df <- tibble::tribble(
  ~word, ~lexicon,
  "the", "custom",
   "of", "custom",
   "is", "custom"
  )
desired_output <- tibble::tribble(
  ~Document,                   ~ngram,
          1,                  "basis",
          1,                "culture",
          1,           "ground water",
          1, "ground water treatment"
  )

^{由reprex package（v0.2.1）于2019-03-21创建}

期望的行为

应使用ngram_df中desired_output列的停用词将word转换为stopword_df。
每行包含停用词的行都应删除
应遵守单词边界（即寻找is不应删除basis）

^{我第一次尝试以下reprex：

示例数据

library(tidyverse)
library(tidytext)
df <- "Groundwater remediation is the process that is used to treat polluted groundwater by removing the pollutants or converting them into harmless products." %>%
enframe() %>%
unnest_tokens(ngrams, value, "ngrams", n = 2)
#apply magic here

df
#> # A tibble: 21 x 2
#> name ngrams
#> <int> <chr>
#> 1 1 groundwater remediation
#> 2 1 remediation is
#> 3 1 is the
#> 4 1 the process
#> 5 1 process that
#> 6 1 that is
#> 7 1 is used
#> 8 1 used to
#> 9 1 to treat
#> 10 1 treat polluted
#> # ... with 11 more rows

停用词示例

stopwords <- c("is", "the", "that", "to")

所需的输出

#> Source: local data frame [9 x 2]
#> Groups: <by row>
#>
#> # A tibble: 9 x 2
#> name ngrams
#> <int> <chr>
#> 1 1 groundwater remediation
#> 2 1 treat polluted
#> 3 1 polluted groundwater
#> 4 1 groundwater by
#> 5 1 by removing
#> 6 1 pollutants or
#> 7 1 or converting
#> 8 1 them into
#> 9 1 harmless products

^{由reprex package（v0.2.1）于2019-03-20创建}

（来自https://en.wikipedia.org/wiki/Groundwater_remediation的例句）}

Answer 1

在这里，您还有另一种使用上一个答案中的“ stopwords_collapsed”的方法：

swc <- paste(stopwords, collapse = "|")
df <- df[str_detect(df$ngrams, swc) == FALSE, ] #select rows without stopwords

df
# A tibble: 8 x 2
   name ngrams                 
  <int> <chr>                  
1     1 groundwater remediation
2     1 treat polluted         
3     1 polluted groundwater   
4     1 groundwater by         
5     1 by removing            
6     1 pollutants or          
7     1 or converting          
8     1 harmless products

这里有一个比较两个系统的简单基准：

#benchmark
txtexp <- rep(txt,1000000)
dfexp <- txtexp %>% 
    enframe() %>% 
    unnest_tokens(ngrams, value, "ngrams", n = 2)

benchmark("mutate+filter (small text)" = {df1 <- df %>%
        mutate(
            has_stop_word = str_detect(ngrams, stopwords_collapsed)
        ) %>%
        filter(!has_stop_word)},
          "[] row selection (small text)" = {df2 <- df[str_detect(df$ngrams, stopwords_collapsed) == FALSE, ]},
        "mutate+filter (large text)" = {df3 <- dfexp %>%
            mutate(
                has_stop_word = str_detect(ngrams, stopwords_collapsed)
            ) %>%
            filter(!has_stop_word)},
        "[] row selection (large text)" = {df4 <- dfexp[str_detect(dfexp$ngrams, stopwords_collapsed) == FALSE, ]},
          replications = 5,
          columns = c("test", "replications", "elapsed")
)

                           test replications elapsed
4 [] row selection (large text)            5   30.03
2 [] row selection (small text)            5    0.00
3    mutate+filter (large text)            5   30.64
1    mutate+filter (small text)            5    0.00

使用tidytext删除包含停用词的ngram

1 个答案: