以下是我从文本数据创建bigrams的代码。我得到的输出很好,除了我需要字段名称有一个下划线,以便我可以使用这些作为模型的变量。
text<- c("Since I love to travel, this is what I rely on every time.",
"I got the rewards card for the no international transaction fee",
"I got the rewards card mainly for the flight perks",
"Very good card, easy application process, and no international
transaction fee",
"The customer service is outstanding!",
"My wife got the rewards card for the gift cards and international
transaction fee.She loves it")
df<- data.frame(text)
library(tm)
corpus<- Corpus(DataframeSource(df))
corpus<- tm_map(corpus, content_transformer(tolower))
corpus<- tm_map(corpus, removePunctuation)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
corpus<- tm_map(corpus, stripWhitespace)
BigramTokenizer<-
function(x)
unlist(lapply(ngrams(words(x),2),paste,collapse=" "),use.names=FALSE)
dtm<- DocumentTermMatrix(corpus, control= list(tokenize= BigramTokenizer))
sparse<- removeSparseTerms(dtm,.80)
dtm2<- as.matrix(sparse)
dtm2
以下是输出结果:
Terms
Docs got rewards international transaction rewards card transaction fee
1 0 0 0 0
2 1 1 1 1
3 1 0 1 0
4 0 1 0 1
5 0 0 0 0
6 1 1 1 0
如何使字段名称如 got_rewards 而不是获得奖励
答案 0 :(得分:1)
我猜这不是一个真正的tm
具体问题。无论如何,您可以在代码中设置collapse="_"
或在事后修改列名称,如下所示:
colnames(dtm2) <- gsub(" ", "_", colnames(dtm2), fixed = TRUE)
dtm2
Terms
Docs got_rewards international_transaction rewards_card transaction_fee
1 0 0 0 0
2 1 1 1 1
3 1 0 1 0
4 0 1 0 1
5 0 0 0 0
6 1 1 1 0