我正在处理非结构化数据(文本)。 我想用一些关键词和关键词的组合来标记数据。
我无法使用单词组合标记数据。我想知道“欺诈”和“误导”的发生地点。
我尝试使用qdap包 我能够用OR条件标记这两个单词而不用AND条件
以下是我使用的代码
library (qdap)
df<- read.csv (file.choose(),header=T)
####cleaning of text
df$Comment<- strip(df$Comment)##remove capitalization and punctuation
df$Comment<- clean (df$Comment)
df$Comment<- scrubber(df$Comment)
df$Comment<- qprep(df$Comment)
df$Comment<-replace_abbreviation(df$Comment)
terms <- list(
" fraud ",
" refund "," cheat ", " cancellation ", "missold", "delay",
combo1= qcv(fraud,missold) )
df2<-with (df, termco(df$Comment, df$Comment, terms))[["raw"]]###tagging of data with key words
df3<- merge (df, df2, by="Comment")
我正在为保险公司使用投诉数据 我的变量是
答案 0 :(得分:0)
基于你的样本xlsx:
library(xlsx)
df <- read.xlsx(file="sample output.xlsx", sheetIndex=1)
library(tm)
terms <- stemDocument(c("fraud","refund","cheat", "cancellation", "misselling", "delay"))
mat <- DocumentTermMatrix(x=Corpus(VectorSource(df$Comment)),
control=list(removePunctuation = TRUE,
dictionary = terms,
stemming = TRUE,
weighting = weightBin))
df2 <- as.data.frame(as.matrix(mat))
(df2 <- transform(df2, combo = fraud + missel))
df2
# cancel cheat delay fraud missel refund combo
# 1 1 0 0 1 1 0 2
# 2 1 0 0 1 1 0 2
# 3 0 0 0 1 1 0 2
df3 <- cbind(df, df2)
df3