我有一个像这样的角色矢量:
sent <- c("The quick brown fox jumps over the lazy dog.",
"Over the lazy dog jumped the quick brown fox.",
"The quick brown fox jumps over the lazy dog.")
我使用textcnt()
生成双字母组合如下:
txt <- textcnt(sent, method = "string", split = " ", n=2, tolower = FALSE)
format(txt)
给了我所有的双字母
frq rank bytes Encoding
Over the 1 4.5 8 unknown
The quick 2 11.5 9 unknown
brown fox 2 11.5 9 unknown
brown fox. 1 4.5 10 unknown
dog jumped 1 4.5 10 unknown
dog. Over 1 4.5 9 unknown
fox jumps 2 11.5 9 unknown
fox. The 1 4.5 8 unknown
jumped the 1 4.5 10 unknown
jumps over 2 11.5 10 unknown
lazy dog 1 4.5 8 unknown
lazy dog. 2 11.5 9 unknown
over the 2 11.5 8 unknown
quick brown 3 15.5 11 unknown
the lazy 3 15.5 8 unknown
the quick 1 4.5 9 unknown
真实数据有更多句子。我有两个问题:
1.是否有可能提到每个句子末尾的点应该在结果的ngrams中被截断?
2.是否有可能阻止产生跨越两个句子的ngrams? dog. Over
和fox. The
答案 0 :(得分:1)
您可以通过避免 texcnt 来避免 textcnt 中的特定ngram。 :-)为了充实@ lukeA的评论,这里是完整的 quanteda 解决方案。
require(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.5.19’
这将标记化为双字母组合,并同时删除标点符号。因为每个句子都是&#34;文件&#34;,所以bigrams永远不会跨越文档。
(bigramToks <- tokenize(sent, ngrams = 2, removePunct = TRUE, concatenator = " "))
tokenizedText object from 3 documents.
## Component 1 :
## [1] "The quick" "quick brown" "brown fox" "fox jumps" "jumps over" "over the" "the lazy" "lazy dog"
##
## Component 2 :
## [1] "Over the" "the lazy" "lazy dog" "dog jumped" "jumped the" "the quick" "quick brown" "brown fox"
##
## Component 3 :
## [1] "The quick" "quick brown" "brown fox" "fox jumps" "jumps over" "over the" "the lazy" "lazy dog"
要获得这些频率,您应该使用dfm()
构建文档特征矩阵,将bigrams标记制成表格。 (注意:您可以跳过标记化步骤并使用dfm(sent, ngrams = 2, toLower = FALSE, concatenator = " ")
直接完成此操作。)
(bigramDfm <- dfm(bigramToks, toLower = FALSE, verbose = FALSE))
## Document-feature matrix of: 3 documents, 12 features.
## 3 x 12 sparse Matrix of class "dfmSparse"
## features
## docs The quick quick brown brown fox fox jumps jumps over over the the lazy lazy dog Over the dog jumped
## text1 1 1 1 1 1 1 1 1 0 0
## text2 0 1 1 0 0 0 1 1 1 1
## text3 1 1 1 1 1 1 1 1 0 0
## features
## docs jumped the the quick
## text1 0 0
## text2 1 1
## text3 0 0
topfeatures(bigramDfm, n = nfeature(bigramDfm))
## quick brown brown fox the lazy lazy dog The quick fox jumps jumps over over the Over the
## 3 3 3 3 2 2 2 2 1
## dog jumped jumped the the quick
## 1 1 1