Question

我有一个大型数据框，我在其中识别字符串中的模式然后将其提取出来。我提供了一个小子集来说明我的任务。我通过创建一个包含多个单词的TermDocumentMatrix来生成模式。我使用stri_extract和stringi和stringr包中的str_replace这些模式在'punct_prob'数据框中搜索。

我的问题是我需要在'punct_prob $ description'中保持标点符号以保持每个字符串中的字面含义。例如，我不能让2.35毫米变成235毫米。我使用的TermDocumentMatrix程序正在删除标点符号（或至少是句点），因此我的模式搜索功能无法与它们匹配。

简而言之......在生成TDM时如何保持标点符号？我尝试在TermDocumentMatrix控件参数中包含removePunctuation = FALSE，但没有成功。

library(tm)
punct_prob = data.frame(description = tolower(c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
                                    "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
                                    "TITANIUM LINE POWER P. B F.O. TRIP SPR",
                                    "MEDESY SPECIAL ITEM")))

punct_prob$description = as.character(punct_prob$description)

# a control for the number of words in phrases
max_ngram = max(sapply(strsplit(punct_prob$description, " "), length))

#set up ngrams and tdm
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = max_ngram, max = max_ngram))}
punct_prob_corpus = Corpus(VectorSource(punct_prob$description))
punct_prob_tdm <- TermDocumentMatrix(punct_prob_corpus, control = list(tokenize = BigramTokenizer, removePunctuation=FALSE))
inspect(punct_prob_tdm)

检查结果 - 没有标点......

                                   Docs
Terms                              1 2 3 4
  angle head 2 1 for 2 35mm bur    1 0 0 0
  contra angle head 2 1 for 2 35mm 1 0 0 0
  line mini p b f o trip spray     0 1 0 0
  line power p b f o trip spr      0 0 1 0
  titanium line mini p b f o trip  0 1 0 0
  titanium line power p b f o trip 0 0 1 0

感谢您提前提供任何帮助：）

Answer 1

问题不在于termdocumentmatrix，而是基于RWEKA的ngram tokenizer。 Rweka在进行标记化时删除了标点符号。

如果使用nlp tokenizer，它会保留标点符号。请参阅下面的代码。

P.S。我删除了第3个文本行中的一个空格，因此P. B.是P.B.就像它在第2行。

library(tm)
punct_prob = data.frame(description = tolower(c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
                                                "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
                                                "TITANIUM LINE POWER P.B F.O. TRIP SPR",
                                                "MEDESY SPECIAL ITEM")))
punct_prob$description = as.character(punct_prob$description)

max_ngram = max(sapply(strsplit(punct_prob$description, " "), length))

punct_prob_corpus = Corpus(VectorSource(punct_prob$description))




NLPBigramTokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), max_ngram), paste, collapse = " "), use.names = FALSE)
}


punct_prob_tdm <- TermDocumentMatrix(punct_prob_corpus, control = list(tokenize = NLPBigramTokenizer))
inspect(punct_prob_tdm)

<<TermDocumentMatrix (terms: 3, documents: 4)>>
Non-/sparse entries: 3/9
Sparsity           : 75%
Maximal term length: 38
Weighting          : term frequency (tf)

                                        Docs
Terms                                    1 2 3 4
  contra angle head 2:1 for 2.35mm bur   1 0 0 0
  titanium line mini p.b f.o. trip spray 0 1 0 0
  titanium line power p.b f.o. trip spr  0 0 1 0

Answer 2

quanteda 包非常智能，无需将字内标点符号视为“标点符号”。这使得构建矩阵非常容易：

txt <- c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
         "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
         "TITANIUM LINE POWER P.B F.O. TRIP SPR",
         "MEDESY SPECIAL ITEM")

require(quanteda)
myDfm <- dfm(txt, ngrams = 6:8, concatenator = " ")
t(myDfm)
#                                        docs
# features                                text1 text2 text3 text4
#   contra angle head for 2.35mm bur          1     0     0     0
#   titanium line mini p.b f.o trip           0     1     0     0
#   line mini p.b f.o trip spray              0     1     0     0
#   titanium line mini p.b f.o trip spray     0     1     0     0
#   titanium line power p.b f.o trip          0     0     1     0
#   line power p.b f.o trip spr               0     0     1     0
#   titanium line power p.b f.o trip spr      0     0     1     0

如果你想保留“标点符号”，它会在结束一个术语时被标记为一个单独的标记：

myDfm2 <- dfm(txt, ngrams = 8, concatenator = " ", removePunct = FALSE)
t(myDfm2)
#                                          docs
# features                                  text1 text2 text3 text4
#   titanium line mini p.b f.o . trip spray     0     1     0     0
#   titanium line power p.b f.o . trip spr      0     0     1     0

请注意，ngrams参数是完全灵活的，可以采用ngram大小的向量，如第一个示例中ngrams = 6:8表示它应该形成6,7和8-克。

R如何使用TermDocumentMatrix（）

2 个答案: