我的语料库中带有@的提及和#与#的标签。我也想保留它们和它们的大小写,这样我就可以得到wordcloud和ggplot的提及和主题标签。但是在将语料库转换为dtm之后,它给了我不含@和#符号的文本。
语料库是:
[1] despite efforts efforts party vested interests allow happen filed two petitions one cj also pushed issue ecp pmln jui anp
[2] believe party #ppp #mqm #anp support like one party aftar party
[3] @manzoorpashteen @bjsocialist
[4] planted talkshows organized #occupiedmedia #ptm pashtuns like shahi syed anwar kakar @asadmunir invited malign @manzoorpashteen vein #ptm' popularity among pashtuns oppressed nations rise @ariftabassum writes
[5] @bushragohar true unmarried woman eventually get insane like bushra gohar allah ap pe rehem kare madam bushra apki jaldi se shadi ho jae @asiab
所需的语料库是:
[1] despite efforts efforts party vested interests allow happen filed two petitions one cj also pushed issue ecp pmln jui anp
[2] believe party #ppp #MQM #ANP support like one party aftar party
[3] @ManzoorPashteen @BJSocialist
[4] planted talkshows organized #OccupiedMedia #PTM pashtuns like shahi syed anwar kakar @AsadMunir invited malign @ManzoorPashteen vein #PTM's popularity among pashtuns oppressed nations rise @ArifTabassum writes
[5] @BushraGohar true unmarried woman eventually get insane like bushra gohar allah ap pe rehem kare madam bushra apki jaldi se shadi ho jae @Asiab
但是我将语料库转换为dtm之后获得的term_freq看起来像这样:
"manzoorpashteen" "anp" "mqm" "asiab" "jaldi" "occupiedmedia"
2 1 1 1 1 1
"bushragohar" "ppp" "get" etc...
1 1 1
所需的term_freq是:
"ManzoorPashteen" "#ANP" "#MQM" "@Asiab" "jaldi" "#OccupiedMedia"
2 1 1 1 1 1
"@BushraGohar" "#ppp" "get"etc...
1 1 1
我该如何实现?