应用错误收集

创建忽略数字的ngram令牌，但不将其从ngram中删除

时间：2019-04-14 09:06:11

标签： r nlp n-gram

我正在根据句子向量创建ngram标记。其中一些句子的字符串中包含数字。我想查找每个句子的三字母组，其中在查找ngram时数字会被忽略但不会从中删除。

例如，如果我有一个字符串：“这是一个示例2019字符串”，而我想从中查找三字母组，我想找回：

“这是一个”，“是一个示例”，“一个示例2019字符串”。

library(tidyverse)
library(quanteda)

test_sentence <- "this is an example 2019 string" 

ngrams <- test_sentence %>% tokens(., ngrams = 3, what = "fasterword", remove_numbers = FALSE, concatenator = " ")

tokens from 1 document.
text1 :
[1] "this is an"          "is an example"       "an example 2019"     "example 2019 string"

有人知道如何忽略Trigram中的数字吗？

谢谢

0 个答案:

没有答案