Question

以下是我从文本数据创建bigrams的代码。我得到的输出很好，除了我需要字段名称有一个下划线，以便我可以使用这些作为模型的变量。

text<- c("Since I love to travel, this is what I rely on every time.", 
        "I got the rewards card for the no international transaction fee", 
        "I got the rewards card mainly for the flight perks",
        "Very good card, easy application process, and no international 
transaction fee",
        "The customer service is outstanding!",
        "My wife got the rewards card for the gift cards and international 
transaction fee.She loves it") 
df<- data.frame(text) 


library(tm)
corpus<- Corpus(DataframeSource(df))
corpus<- tm_map(corpus, content_transformer(tolower))
corpus<- tm_map(corpus, removePunctuation)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
corpus<- tm_map(corpus, stripWhitespace)


BigramTokenizer<-
  function(x)
    unlist(lapply(ngrams(words(x),2),paste,collapse=" "),use.names=FALSE)

dtm<- DocumentTermMatrix(corpus, control= list(tokenize= BigramTokenizer))

sparse<- removeSparseTerms(dtm,.80)
dtm2<- as.matrix(sparse)
dtm2

以下是输出结果：

    Terms
Docs got rewards international transaction rewards card transaction fee
   1           0                         0            0               0
   2           1                         1            1               1
   3           1                         0            1               0
   4           0                         1            0               1
   5           0                         0            0               0
   6           1                         1            1               0

如何使字段名称如 got_rewards 而不是获得奖励

Answer 1

我猜这不是一个真正的tm具体问题。无论如何，您可以在代码中设置collapse="_"或在事后修改列名称，如下所示：

colnames(dtm2) <- gsub(" ", "_", colnames(dtm2), fixed = TRUE)
dtm2
    Terms
Docs got_rewards international_transaction rewards_card transaction_fee
   1           0                         0            0               0
   2           1                         1            1               1
   3           1                         0            1               0
   4           0                         1            0               1
   5           0                         0            0               0
   6           1                         1            1               0

在R库（tm）中，我该如何使用下划线获取NGRAMS输出

1 个答案: