从语料库到文档术语矩阵保留@和#符号及其大小写

时间:2018-06-27 07:01:36

标签: r matrix document corpus term

我的语料库中带有@的提及和#与#的标签。我也想保留它们和它们的大小写,这样我就可以得到wordcloud和ggplot的提及和主题标签。但是在将语料库转换为dtm之后,它给了我不含@和#符号的文本。

语料库是:

[1]  despite efforts efforts party vested interests allow happen filed two petitions one cj also pushed issue ecp pmln jui anp                                                                                                                                                        
[2]  believe party #ppp #mqm #anp support like one party aftar party                                                                                                                                                                                                                     
[3] @manzoorpashteen @bjsocialist                                                                                                                                                                                                      
[4] planted talkshows organized #occupiedmedia #ptm pashtuns like shahi syed anwar kakar @asadmunir invited malign @manzoorpashteen vein #ptm' popularity among pashtuns oppressed nations rise @ariftabassum writes                                                                  
[5] @bushragohar true unmarried woman eventually get insane like bushra gohar allah ap pe rehem kare madam bushra apki jaldi se shadi ho jae @asiab 

所需的语料库是:

[1]  despite efforts efforts party vested interests allow happen filed two petitions one cj also pushed issue ecp pmln jui anp                                                                                                                                                        
[2]  believe party #ppp #MQM #ANP support like one party aftar party                                                                                                                                                                                                                     
[3] @ManzoorPashteen @BJSocialist                                                                                                                                                                                                      
[4] planted talkshows organized #OccupiedMedia #PTM pashtuns like shahi syed anwar kakar @AsadMunir invited malign @ManzoorPashteen vein #PTM's popularity among pashtuns oppressed nations rise @ArifTabassum writes                                                                  
[5] @BushraGohar true unmarried woman eventually get insane like bushra gohar allah ap pe rehem kare madam bushra apki jaldi se shadi ho jae @Asiab

但是我将语料库转换为dtm之后获得的term_freq看起来像这样:

"manzoorpashteen"     "anp"    "mqm"   "asiab"   "jaldi"   "occupiedmedia"  
        2               1        1        1         1            1
"bushragohar"          "ppp"    "get" etc...
      1                  1        1

所需的term_freq是:

"ManzoorPashteen"     "#ANP"    "#MQM"   "@Asiab"   "jaldi"   "#OccupiedMedia"  
        2               1          1        1         1            1
"@BushraGohar"        "#ppp"      "get"etc...
      1                  1          1 

我该如何实现?

0 个答案:

没有答案