如何在TermDocumentMatrix中将行组合成一行?

时间:2016-10-03 06:24:43

标签: r text-mining term-document-matrix

Iam尝试将行组合到TermDocumentMatrix

中的行上

(我知道每一行代表每个单词)

ex) cabin, staff -> crews

因为'小屋,工作人员和工作人员'意思相同, 我试图结合代表'小屋,员工'的行 成一排代表'船员。

但是,它根本不起作用。

R说argument "weighting" is missing, with no default

我输入的代码在

之下
r=GET('http://www.airlinequality.com/airline-reviews/cathay-pacific-airways/')
base_url=('http://www.airlinequality.com/airline-reviews/cathay-pacific-airways/')
h<-read_html(base_url)

all.reviews = c()

for (i in 1:10){
print(i)
url = paste(base_url, 'page/', i, '/', sep="")
r = GET(url)
h = read_html(r)
comment_area = html_nodes(h, '.tc_mobile')
comments= html_nodes(comment_area, '.text_content')
reviews = html_text(comments)
all.reviews=c(all.reviews, reviews)} 

cps <- Corpus(VectorSource(all.reviews))
cps <- tm_map(cps, content_transformer(tolower)) 
cps <- tm_map(cps, content_transformer(stripWhitespace))
cps <- tm_map(cps, content_transformer(removePunctuation))
cps <- tm_map(cps, content_transformer(removeNumbers))
cps <- tm_map(cps, removeWords, stopwords("english"))

tdm <- TermDocumentMatrix(cps, control=list(
wordLengths=c(3, 20),
weighting=weightTf))

rows.cabin = grep('cabin|staff', row.names(tdm))
rows.cabin
# [1]  235 1594
count.cabin = as.array(rollup(tdm[rows.cabin,], 1)) 
count.cabin
#Docs
#Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26   27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
#1 0 1 1 0 0 2 2 0 0  1  1  0  4  0  1  0  1  0  2  1  0  0  1  3  1  4  2  0  3  0  1  1  4  0  0  2  1  0  0  2  1  0  2  1  3  3  1
 #Docs
#Terms 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
#1  0  1  0  1  2  3  2  2  1  1  0  2  0  0  0  0  0  2  0  1  0  0  4  0  2  2  1  3  1  1  1  1  0  0  0  5  3  0  2  1  0  1  0  0
 #Docs
#Terms 92 93 94 95 96 97 98 99 100
#1  1  5  2  1  0  0  0  1   0
row.crews = grep('crews', row.names(tdm))
row.crews
#[1] 408
tdm[row.crews,] = count.cabin
rows.cabin = setdiff(rows.cabin, row.crews) # ok
tdm = tdm[-rows.cabin,] # ok

dtm = as.DocumentTermMatrix(tdm)
# Error in .TermDocumentMatrix(t(x), weighting) :
# argument "weighting" is missing, with no default

也许在TermDocumentMatrix

中组合行是不正确的方法

请修改此代码或建议更好的方法来解决此问题。

提前致谢。

1 个答案:

答案 0 :(得分:0)

嗯,我想知道为什么你坚持你的方法,这显然不起作用,而不仅仅是复制+粘贴+调整* here的建议?

library(tm)
library(httr)
library(rvest)
library(slam)
# [...] # your code
inspect(tdm[grep("cabin|staff|crew", Terms(tdm), ignore.case=TRUE), 1:15])
#        Docs
# Terms   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#   cabin 0 0 0 0 0 1 1 0 0  1  0  0  3  0  0
#   crew  0 0 0 1 1 1 1 0 2  1  0  1  0  2  0
#   crews 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0
#   staff 0 1 1 0 0 1 1 0 0  0  1  0  1  0  1

dict <- list(
  "CREW" = grep("cabin|staff|crew", Terms(tdm), ignore.case=TRUE, value = TRUE)
)
terms <- Terms(tdm)
for (x in seq_along(dict)) 
  terms[terms %in% dict[[x]] ] <- names(dict)[x]
tdm <- slam::rollup(tdm, 1, terms, sum)
inspect(tdm[grep("cabin|staff|crew", Terms(tdm), ignore.case=TRUE), 1:15])
#       Docs
# Terms  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#   CREW 0 1 1 1 1 3 3 0 2  2  1  1  4  2  1

*我只调整了dict定义中的行...