我有一个数据框,包括推文,创建日期,推特ID,收藏和转推计数。我想创建一个语料库,其中包含每个文档,收藏和转推计为变量。我还想通过推文ID识别文档,而不是通过随机doc 001等ID。
我从下面的数据开始......请参阅下面的其他代码
id
1: 737243856144629760
2: 737242308261842945
3: 737242189055594496
4: 737242018687164416
5: 737241411465170944
6: 737239685295181824
text
1: Have a great Memorial Day and remember that we will soon MAKE AMERICA GREAT AGAIN!
2: "@NBCDFW: Trump rallies veterans at annual Rolling Thunder Gathering https://twitter.com/b08FcMlgkr https://twitter.com/RCDeLvHQqD"
3: "@FrankyLamouche: how many of donald's rolling thunder brigade will sign up and go to war for him in the middle east."
4: "@MariaErnandez3b: Trump Supports Rolling Thunder Rally #TRUMP STRONG https://twitter.com/pfVXQ8NdZu" So true, and remember the M.I.A.'s!
5: "@ScottWRasmussen: Donald Trump and Bikers Share Affection at Rolling Thunder Rally https://twitter.com/ZZl2sc29dn" A great day in D.C.!
6: "@TeaPartyNevada: #Trump2016 "Illegals are taken care of better than our veterans." https://twitter.com/KKIgM4rNma https://twitter.com/1cEZ8wG7Cy"
favorited favoritwitter.comunt replyToSN created truncated replyToSID replyToUID
1: FALSE 25944 NA 2016-05-30 11:26:47 FALSE NA NA
2: FALSE 9268 NA 2016-05-30 11:20:38 FALSE NA NA
3: FALSE 6739 NA 2016-05-30 11:20:09 FALSE NA NA
4: FALSE 15417 NA 2016-05-30 11:19:29 FALSE NA NA
5: FALSE 7192 NA 2016-05-30 11:17:04 FALSE NA NA
6: FALSE 9834 NA 2016-05-30 11:10:12 FALSE NA NA
statusSource screenName retweetCount
1: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump 9455
2: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump 2744
3: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump 1604
4: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump 4237
5: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump 2148
6: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump 3545
isRetweet retweeted longitude latitude
1: FALSE FALSE NA NA
2: FALSE FALSE NA NA
3: FALSE FALSE NA NA
4: FALSE FALSE NA NA
5: FALSE FALSE NA NA
6: FALSE FALSE NA NA
cleantxt
1: have a great memorial day and remember that we will soon make america great again!
2: "@nbcdfw: trump rallies veterans at annual rolling thunder gathering https://twitter.com/b08fcmlgkr https://twitter.com/rcdelvhqqd"
3: "@frankylamouche: how many of donald's rolling thunder brigade will sign up and go to war for him in the middle east."
4: "@mariaernandez3b: trump supports rolling thunder rally #trump strong https://twitter.com/pfvxq8ndzu" so true, and remember the m.i.a.'s!
5: "@scottwrasmussen: donald trump and bikers share affection at rolling thunder rally https://twitter.com/zzl2sc29dn" a great day in d.c.!
6: "@teapartynevada: #trump2016 "illegals are taken care of better than our veterans." https://twitter.com/kkigm4rnma https://twitter.com/1cez8wg7cy"
我尝试将其转换为带
的语料库myReader <- readTabular(mapping=list(content="cleantxt", id="id", created="created", retweet="retweetCount", fav="favoriteCount"))
trumptweetsenhanced <- VCorpus(DataframeSource(trumptweets.df), readerControl=list(reader=myReader))
但是,当我将语料库转换回数据框时,没有添加变量
> head(trumptweetsenhanced_dataframe.df)
docs text
1 doc 0001 great memori day rememb will soon make america great
2 doc 0002 nbcdfw trump ralli veteran annual roll thunder gather
3 doc 0003 frankylamouch mani donald roll thunder brigad will sign go war middl east
4 doc 0004 mariaernandezb trump support roll thunder ralli trump strong true rememb ms
5 doc 0005 scottwrasmussen donald trump biker share affect roll thunder ralli great day dc
6 doc 0006 teapartynevada trump illeg taken care better veteran
答案 0 :(得分:1)
您可以使用tm::meta()
功能向您的tweets-corpus添加元数据。见library(tm); example(meta)
。
此元数据注释可以在每个语料库级别上发生 - 您可能希望存储&#34;常见&#34;元数据,例如收集此语料库中的推文的日期,搜索查询字符串,API调用详细信息等等。
注释也可以在每个文档级别上发生(在这种情况下,在每个推文级别上) - 您可以在语料库中存储来自trumptweets.df数据框的tweet-attributes,例如转推计数,fav - 计数,语言等。
这意味着聪明而细致的管家。您通常使用一组自定义函数和* apply-family函数来以读写方式调用meta()。 (或使用purrr :: walk *,或purrr :: map *)
我把这个写在了我的头顶。自从我使用meta()以来已经有一段时间了。也许有一种完全不同的方式(使用嵌套数据框,或使用其他文本挖掘包)。