R将缺失的关键字分类为-NA-

时间:2017-06-21 14:02:00

标签: r keyword text-mining text-classification

我是R的新手,我正致力于根据关键字对文字进行分类。我在CSV文件中提到了用户提到的评论。我只导入前50个 根据遇到的关键字对其进行评论和分类。

以下代码通过以下代码执行:

library(text2vec) # to create the dtm
library(tokenizers) # to help creating the dtm
library(reshape2) # to reshape the data from wide to long
library(tm)
app_r <- read.csv('my-tracks-reviews.csv', stringsAsFactors = FALSE)
description <- app_r$comment_text[1:50]

#Text Cleaning
text <- Corpus(VectorSource(description))
text <- tm_map(text, content_transformer(tolower))
text <- tm_map(text, removeNumbers)
text <- tm_map(text, removePunctuation)
text <- tm_map(text, removeWords,stopwords('english'))
#text <- tm_map(text, removeWords,c('crap','excellent')) #Additional words
text <- tm_map(text, stripWhitespace)
print(text)
dataframe <- data.frame(text=sapply(text,identity), stringsAsFactors=FALSE)
str(dataframe)
df1 <- structure(list(keyword = c("gps", "signal", "battery", "map", "track", "tracks", "app", "update","updates"), 
                      category = c("S", "S", "P", "M", "M", "M", "O", "U","U")), .Names = c("Keyword", "Category"), 
                 class = "data.frame", row.names = c(NA,-9L))
df1

#Creating vocabulary
vocabulary <- vocab_vectorizer(create_vocabulary(itoken(df1$Keyword)))

# 2. create the dtm
dtm <- create_dtm(itoken(as.character(dataframe$text)),vocabulary)

# 3. convert the sparse-matrix to a data.frame
dtm_df <- as.data.frame(as.matrix(dtm))
dtm_df$text <- dataframe$text
str(dtm_df)
# 4. melt to long format
df_result <- melt(dtm_df, id.vars = "text", variable.name = "Keyword")

# 5. combine the data, i.e., add category
df_final <- merge(df_result, df1, by = "Keyword")
df_final 
write.csv(df_final, file = "d:/R/testml/keywords.csv", row.names = FALSE)

输出以CSV文件写入,格式如下:

 Keyword                text                             Category
    app     really briliant app intuitive informative       O
    app         great app glad used tracks perfectly        O
    app              interesting app                        O

但我的问题是,在导入的50个条目中,它只在输出CSV文件中存储了33个分类评论的条目,其他17个被丢弃,我需要将它们作为非适用(NA)条目提及将它们与其他文件一起存储在CSV文件中(主要是“NA”类别)。

'my-tracks-reviews'中的数据部分如下:

    **comment_text**    **rating**
 "really briliant app    it's intuitive and informative giving all the information you could need and seemingly very accurate."  5
"will not connect to gps    app does not connect to gps no matter how long i have it on. i have gps set on high accuracy and other settings appear to be set as these should be. the app is useless to me  if it can't track my workout."   1
"wish this would interest more with google now to provide weekly or monthly summaries."   5
"useless    does not talk to gps on the phone. 20 minute run no data."  1
"great app  so glad i used this it tracks perfectly."   4
"excellent  thank you." 5
"update i wish this app had quick sharing where i could view on any device without drive and ability to view on google maps instead of google earthadv version of maps cuz i do most of my home stuff on my tablet and exercise with my phone it just a hassle.. overall i like the app it needs a update to keep up with rest of the google products and become more competitive with other products." 4
"nice but needs work    used this app a few times now and every time it takes an age to locate my position via gps. changing to other apps i have that rely on gps they locate me straight away. return to this app and it is still searching. last time it took over five minutes to locate my position this took some of the enjoyment out of using this app. will continue to use this to see if it improves. 3/5."  3
"dr anand venugopal too good."  4
"very interesting app."   5
"brilliant  good solid app."    5
"like   like."  4

其中comment_text是一列,而rating是另一列,包含相应的数据。

0 个答案:

没有答案