图表中表示的文本的云比较(wordCloud包)

时间:2015-04-22 04:31:02

标签: r tm word-cloud

我有一个与tm包重新处理的与查询相关的内容集(来自电子邮件)。想要以图形方式表示它,我遇到了this twitter cloud comparison on text并尝试加载和表示我的数据。我有超过500个语料库数据列表。转换为DocumentTermMatrix时,它会为列表中的所有单词提供超过3k的单词。

数据:(语料库) - b

[[538]]
<<PlainTextDocument (metadata: 7)>>
  kumar m santhosh   monday  october   pm  rizal herwin g s venkatesh global business reporting cc tjhin minarti arsojo nindyo subje

[[539]]
<<PlainTextDocument (metadata: 7)>>
  harjono bambang  wednesday  october   pm  global business reporting cc saptadi firman subject re commercial asia booking point limits  

[[540]]
<<PlainTextDocument (metadata: 7)>>
  kumar m santhosh   tuesday  october     global business reporting ramesh sandeep talanki   g s venkatesh cc challagundla ram bhupal chowdary subject fw please approve  qlikview gpa access please action  access request regards santhosh   monteleone elif  monday  october     g s venkatesh kumar m santhosh  cc singh sarvjeet saini subject fw please approve  qlikview gpa access hi guys can  please get access    finiasi jieni  monday  october     monteleone elif subject fw please approve  qlikview gpa access hi elif hope   well    able  approve  request  access   pacific sites please regards jieni   finiasi jieni  monday  september     deo ravinesh subject please approve  qlikview gpa access hello can  please review  attached form  click line manager approval  approve 

[[541]]
<<PlainTextDocument (metadata: 7)>>
roe clarification

[[542]]
<<PlainTextDocument (metadata: 7)>>
  heo jae hyun  wednesday  october     icis helpdesk subject case id  fw questions  gpa hi team  response   inquiry   jae hyun heo  director  financial institutions group nd floor kyobo building  chongro  ka chongro ku seoul korea office      mobile      email jaehyunheoanzcom australia  new zealand banking group ltd     heo jae hyun  monday  september     icis helpdesk subject questions  gpa hi team please see  screen copy  gpa  fig korea   like  ask  following questions   terms  revrwa  calculation   key performance ratio  revrwa mtd  gpa  however   calculated  ratio based upon  information  gpa  shows total revenue mtd  rwa mtd   mn  mn    question      gpa  calculated revrwa ytd  gpa  however   calculated  ratio based upon  informaiton  gpa  shows  total revenue ytd  rwa ytd   mn  mn    question      gpa  calculated revrwa fyx  gpa    calculated  ratio based upon  information  gpa  shows  total revenue fyx  rwa fyx   mn  mn      almost     gpa  can  find revrwa ratio   client level  jae hyun heo  director  financial institutions group nd floor kyobo building  chongro  ka chongro ku seoul korea office      mobile      email jaehyunheoanzcom australia  new zealand banking group ltd 

数据$输出:

Report/Data
Access
Access
Access
Report/Data

tdm <- TermDocumentMatrix(b)
matrix <- as.matrix(tdm)
colnames(term.matrix) =c(data$Output)
#for each list in data corresponding output is must be attcahed 
#here output-("Access","Report/Data") is represented as 1 and 2


 comparison.cloud(term.matrix,max.words=2000,random.order=FALSE)
    commonality.cloud(term.matrix,random.order=FALSE)
#error Error in strwidth(words[i], cex = size[i], ...) : invalid 'cex' value

comparison.cloud的输出低于enter image description here   如何将数字1和2替换为原始内容并有效地表示图表中的文字?

1 个答案:

答案 0 :(得分:1)

使用您提供的数据样本,我创建了一个小数据框。

> dput(df)
structure(c("kumar m santhosh   monday  october   pm  rizal herwin g s venkatesh global business reporting cc tjhin minarti arsojo nindyo subje heo jae hyun  wednesday  october     icis helpdesk subject case id  fw questions  gpa hi team  response   inquiry   jae hyun heo  director  financial institutions group nd floor kyobo building  chongro  ka chongro ku seoul korea office      mobile      email jaehyunheoanzcom australia  new zealand banking group ltd     heo jae hyun  monday  september     icis helpdesk subject questions  gpa hi team please see  screen copy  gpa  fig korea   like  ask  following questions", 
"harjono bambang  wednesday  october   pm  global business reporting cc saptadi firman subject re commercial asia booking point limits    kumar m santhosh   tuesday  october     global business reporting ramesh sandeep talanki   g s venkatesh cc challagundla ram bhupal chowdary subject fw please approve  qlikview gpa access please action  access request regards santhosh   monteleone elif  monday  october     g s venkatesh kumar m santhosh  cc singh sarvjeet saini subject fw please approve  qlikview gpa access hi guys can  please get access    finiasi jieni  monday  october     monteleone elif subject fw please approve  qlikview gpa access hi elif hope   well    able  approve  request  access   pacific sites please regards jieni   finiasi jieni  monday  september     deo ravinesh subject please approve  qlikview gpa access hello can  please review  attached form  click line manager approval  approve  roe clarification"
), .Dim = c(2L, 1L), .Dimnames = list(c("rpt", "acc"), NULL))

然后,按照您的代码进行一些更改。

corpus <- Corpus(VectorSource(df)) # added this call

tdm <- TermDocumentMatrix(corpus)  
term.matrix <- as.matrix(tdm)  # changed to term.matrix
colnames(term.matrix) <- c("access", "report")

library("wordcloud") # added for completeness
comparison.cloud(term.matrix, max.words=2000, random.order=FALSE) # several other arguments are available

enter image description here

继续,

commonality.cloud(term.matrix, random.order=FALSE)

enter image description here