Question

我想计算一串文本中英文单词的数量。

df.words <- data.frame(ID = 1:2,
              text = c(c("frog friend fresh frink foot"),
                       c("get give gint gobble")))

df.words

  ID                         text
1  1 frog friend fresh frink foot
2  2         get give gint gobble

我希望最终产品看起来像这样：

  ID                         text count
1  1 frog friend fresh frink foot     4
2  2         get give gint gobble     3

我猜我必须首先根据空格分开，然后根据字典引用单词？

Answer 1

基于@r2evans 建议使用 strsplit() 并使用随机英文单词 .txt 文件字典 online，示例如下。如果由于 unnest 步骤而进行大量比较，则此解决方案可能无法很好地扩展。

library(dplyr)
library(tidyr)

# text file with 479k English words ~4MB
dict <- read.table(file = url("https://github.com/dwyl/english-words/raw/master/words_alpha.txt"), col.names = "text2")

df.words <- data.frame(ID = 1:2,
                       text = c(c("frog friend fresh frink foot"),
                                c("get give gint gobble")),
                       stringsAsFactors = FALSE)

df.words %>% 
  mutate(text2 = strsplit(text, split = "\\s")) %>% 
  unnest(text2) %>% 
  semi_join(dict, by = c("text2")) %>% 
  group_by(ID, text) %>% 
  summarise(count = length(text2))

输出

     ID text                         count
  <int> <chr>                        <int>
1     1 frog friend fresh frink foot     4
2     2 get give gint gobble             3

Answer 2

Base R 替代方案，使用 EJJ 对 dict 的重要推荐：

sapply(strsplit(df.words$text, "\\s+"),
       function(z) sum(z %in% dict$text2))
# [1] 4 3

我认为这在速度上会是一个明显的赢家，但显然一次 sum(. %in% .) 可能有点贵。（使用此数据时速度较慢。）

更快但不一定更简单：

words <- strsplit(df.words$text, "\\s+")
words <- sapply(words, `length<-`, max(lengths(words)))
found <- array(words %in% dict$text2, dim = dim(words))
colSums(found)
# [1] 4 3

它比 EJJ 的解决方案快一点（约 10-15%），所以如果你需要从它身上榨取一些性能，这可能只是一件好事。

（警告：EJJ 使用这个 2 行数据集更快。如果数据大 1000 倍，那么我的第一个解决方案要快一点，我的第二个解决方案快两倍。基准是基准，但不要如果速度/时间不是关键因素，则优化可用性以外的代码。）

计算R中字符串中英文单词的数量

2 个答案: