Question

我正在编写一个R脚本来查找bigrams。

我有一串4157个单词。

现在，使用stylo，我正在我的向量中使用双字母组合如下。

library(stylo)

allBi <- txt.to.words(myLines)
myBigrams <- make.ngrams(allBi, ngram.size = 2)

只返回4120个双桅帆船。有什么问题？

Answer 1

问题在于你没有进行任何测试以试图找出正在发生的事情。

根据以下测试，myLines中的4,127个条目中的一个（或多个）似乎实际上并不包含＆＃34;字＆＃34;因为style包看到了单词：

library(stylo)

此文件在我的OS X系统上有235,886个合法字词：

words <- readLines("/usr/share/dict/words")

现在，执行测试以查看是否存在与影响make.ngrams或（更有可能）txt.to.words的矢量长度相关的任何内容。注意：我不想等待make.ngrams的cpl分钟来完成高达235,886的序列，所以最大值为20,000，远高于你的4,120：

all(sapply(seq(from=2, to=20000, by=100), function(i) {
  return(i - length(make.ngrams(txt.to.words(words[1:i]), ngram.size=2))==1)
}))
# [1] TRUE

因此，它不是矢量大小问题。它可能是缺少矢量问题中的实际单词吗？让我们测试一下：

# inject some badness
words[4] <- sprintf("  , %s - ", words[4])
words[30] <- "//"
words[900] <- "-1--1-"
words[4000]  <- ".."

再试一次：

all(sapply(seq(from=2, to=20000, by=100), function(i) {
  return(i - length(make.ngrams(txt.to.words(words[1:i]), ngram.size=2))==1)
}))
# [1] FALSE

让我们看一下txt.to.words对真实＆＃34;糟糕＆＃34;

的影响

txt.to.words(words[c(4, 30, 900, 4000)])
# [1] "aal"

使用此选项查找words中没有字母的条目：

which(grepl("^[^[:alpha:]]+$", words))
# [1]   30  900 4000

测试FTW（当事情没有按预期进行时，实际执行某些测试并没有太多工作。）