Question

我正在对银行客户与抵押相关的评论进行一些文本分析，我发现一些我理解的事情。

1）在不应用词干检查字词并检查TDM尺寸的情况下清理数据后，字词（2173）的数量小于文档的数量（2373）（这是在删除停用词并成为TDM之前的1-克）。

2）另外，我想检查二元语法对TDM进行标记的2字频率（rowSums（Matrix））。问题是，例如，我得到的最重复的结果是2个单词“ Proble miss”。由于该分组已经很奇怪，因此我去了数据集“ Control + F”以尝试查找，但我找不到。问题：似乎这些词干了一些代码，怎么可能？（在前25个双词中，这是唯一一个似乎被阻止的词）。难道这不应该只创建始终在一起的二元语法吗？

{file_cleaning <-  replace_number(files$VERBATIM)
file_cleaning <-  replace_abbreviation(file_cleaning)
file_cleaning <-  replace_contraction(file_cleaning)
file_cleaning <- tolower(file_cleaning)
file_cleaning <- removePunctuation(file_cleaning)
file_cleaning[467]
file_cleaned <- stripWhitespace(file_cleaning)

custom_stops <- c("Bank")
file_cleaning_stops <- c(custom_stops, stopwords("en"))
file_cleaned_stopped<- removeWords(file_cleaning,file_cleaning_stops)

file_cleaned_corups<- VCorpus(VectorSource(file_cleaned))
file_cleaned_tdm <-TermDocumentMatrix(file_cleaned_corups)
dim(file_cleaned_tdm) # Number of terms <number of documents
file_cleaned_mx <- as.matrix(file_cleaned_tdm)

file_cleaned_corups<- VCorpus(VectorSource(file_cleaned_stopped))
file_cleaned_tdm <-TermDocumentMatrix(file_cleaned_corups)
file_cleaned_mx <- as.matrix(file_cleaned_tdm)

dim(file_cleaned_mx)
file_cleaned_mx[220:225, 475:478]

coffee_m <- as.matrix(coffee_tdm)

term_frequency <- rowSums(file_cleaned_mx)
term_frequency <- sort(term_frequency, decreasing = TRUE)
term_frequency[1:10]


BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigram_dtm <- TermDocumentMatrix(file_cleaned_corups, control = list(tokenize = BigramTokenizer))
dim(bigram_dtm)

bigram_bi_mx <- as.matrix(bigram_dtm)
term_frequency <- rowSums(bigram_bi_mx)
term_frequency <- sort(term_frequency, decreasing = TRUE)
term_frequency[1:15]

freq_bigrams <- findFreqTerms(bigram_dtm, 25)
freq_bigrams}

数据集示例：

> dput(droplevels(head(files,4)))

structure(list(Score = c(10L, 10L, 10L, 7L), Comments = structure(c(4L,

3L, 1L, 2L), .Label = c("They are nice an quick. 3 years with them, and no issue.",

"Staff not very friendly.",

"I have to called them 3 times. They are very slow.",

"Quick and easy. High value."

), class = "factor")), row.names = c(NA, 4L), class = "data.frame")

Answer 1

Q1：在某些情况下，您最终得到的条款比文档少。

首先，您正在使用vectorsource；文件数是您的txt中的向量数。这并不能真正代表文件数量。带有空格的向量将被视为文档。其次，您要删除停用词。如果您的文本中有很多这样的单词，很多单词将会消失。最后，默认情况下，TermDocumentMatrix会删除所有小于3的单词。因此，如果在删除停用词后还剩下任何小的单词，这些单词也会被删除。您可以通过在创建wordLengths / TermDocumentMatrix时调整选项DocumentTermMatrix来进行调整。

# wordlengths starting at length 1, default is 3
TermDocumentMatrix(corpus, control=list(wordLengths=c(1, Inf)))

第二季度：如果没有样本文档，这有点猜测。

可能是功能replace_number，replace_contraction，replace_abbreviation，removePunctuation和stripWhitespace的组合。这可能会导致您找不到很快的单词。最好的选择是查找每个以prob开头的单词。据我所知，“问题”不是正确的词干。同样，如果不指定qdap和tm，则不会进行任何阻止。

您的custom_stops也有误。所有停用词均使用小写字母，并且您指定了文本应使用小写字母。因此，您的custom_stops也应使用小写字母。 “银行”而不是“银行”。

R-文字分析-误导性的结果

1 个答案: