我想对我的推文的所有主题标签使用文本挖掘。我将所有关于csv文件的数据与Tweets和hashtags分开
file_loc <- "/Users/Desktop/DATA/DATA.csv"
CSVFile <- read.csv(file_loc, header = TRUE, sep= ";",encoding = "UTF-8")
#Add new column as hashtags
CSVFile$hashtags <- sapply(str_extract_all(CSVFile$Text, "#\\S+"), paste, collapse=" ")
#DELETE ALL Lines/Tweets without Hashtags
CSVFile <- CSVFile[!(CSVFile$hashtags == ""), ]
# create a corpus of hashtags
myCorpus <-Corpus(VectorSource(CSVFile$hashtags))
myCorpus <- tm_map(myCorpus,
content_transformer(function(x) iconv(myCorpus, to = "ASCII", sub = " ") ))
所以,问题是我在同一条推文中有几个主题标签,我不想在开始数据预处理之前将它们分开。
示例:
[984] #Springparty #rocknroll #fashion #thankyou #cocktails #photographerVmakeupartistÌã
[985] #holborn #redwine #wednesday #goodfriends #funÌã
[986] #datenight #indianfood #tamashabromley #foodporn #beers #cocktails #kamasutra
[987] #ThisIsLondon #TakeMeBackToLondon
[988] #gameofthrones