从数据框

时间:2016-12-13 03:15:35

标签: r twitter tm

我有一个数据框,包括推文,创建日期,推特ID,收藏和转推计数。我想创建一个语料库,其中包含每个文档,收藏和转推计为变量。我还想通过推文ID识别文档,而不是通过随机doc 001等ID。

我从下面的数据开始......请参阅下面的其他代码

                   id
1: 737243856144629760
2: 737242308261842945
3: 737242189055594496
4: 737242018687164416
5: 737241411465170944
6: 737239685295181824
                                                                                                                                    text
1:                                                    Have a great Memorial Day and remember that we will soon MAKE AMERICA GREAT AGAIN!
2:                 "@NBCDFW: Trump rallies veterans at annual Rolling Thunder Gathering https://twitter.com/b08FcMlgkr https://twitter.com/RCDeLvHQqD"
3:                "@FrankyLamouche: how many of donald's rolling thunder brigade will sign up and go to war for him in the middle east."
4:    "@MariaErnandez3b: Trump Supports Rolling Thunder Rally #TRUMP STRONG https://twitter.com/pfVXQ8NdZu" So true, and remember the M.I.A.'s!
5:     "@ScottWRasmussen: Donald Trump and Bikers Share Affection at Rolling Thunder Rally https://twitter.com/ZZl2sc29dn" A great day in D.C.!
6: "@TeaPartyNevada: #Trump2016 "Illegals are taken care of better than our veterans."  https://twitter.com/KKIgM4rNma https://twitter.com/1cEZ8wG7Cy"
   favorited favoritwitter.comunt replyToSN             created truncated replyToSID replyToUID
1:     FALSE         25944        NA 2016-05-30 11:26:47     FALSE         NA         NA
2:     FALSE          9268        NA 2016-05-30 11:20:38     FALSE         NA         NA
3:     FALSE          6739        NA 2016-05-30 11:20:09     FALSE         NA         NA
4:     FALSE         15417        NA 2016-05-30 11:19:29     FALSE         NA         NA
5:     FALSE          7192        NA 2016-05-30 11:17:04     FALSE         NA         NA
6:     FALSE          9834        NA 2016-05-30 11:10:12     FALSE         NA         NA
                                                                           statusSource      screenName retweetCount
1: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump         9455
2: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump         2744
3: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump         1604
4: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump         4237
5: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump         2148
6: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump         3545
   isRetweet retweeted longitude latitude
1:     FALSE     FALSE        NA       NA
2:     FALSE     FALSE        NA       NA
3:     FALSE     FALSE        NA       NA
4:     FALSE     FALSE        NA       NA
5:     FALSE     FALSE        NA       NA
6:     FALSE     FALSE        NA       NA
                                                                                                                                cleantxt
1:                                                    have a great memorial day and remember that we will soon make america great again!
2:                 "@nbcdfw: trump rallies veterans at annual rolling thunder gathering https://twitter.com/b08fcmlgkr https://twitter.com/rcdelvhqqd"
3:                "@frankylamouche: how many of donald's rolling thunder brigade will sign up and go to war for him in the middle east."
4:    "@mariaernandez3b: trump supports rolling thunder rally #trump strong https://twitter.com/pfvxq8ndzu" so true, and remember the m.i.a.'s!
5:     "@scottwrasmussen: donald trump and bikers share affection at rolling thunder rally https://twitter.com/zzl2sc29dn" a great day in d.c.!
6: "@teapartynevada: #trump2016 "illegals are taken care of better than our veterans."  https://twitter.com/kkigm4rnma https://twitter.com/1cez8wg7cy"

我尝试将其转换为带

的语料库
myReader <- readTabular(mapping=list(content="cleantxt", id="id", created="created", retweet="retweetCount", fav="favoriteCount"))
trumptweetsenhanced <- VCorpus(DataframeSource(trumptweets.df), readerControl=list(reader=myReader))

但是,当我将语料库转换回数据框时,没有添加变量

> head(trumptweetsenhanced_dataframe.df)
      docs                                                                            text
1 doc 0001                            great memori day rememb will soon make america great
2 doc 0002                           nbcdfw trump ralli veteran annual roll thunder gather
3 doc 0003       frankylamouch mani donald roll thunder brigad will sign go war middl east
4 doc 0004     mariaernandezb trump support roll thunder ralli trump strong true rememb ms
5 doc 0005 scottwrasmussen donald trump biker share affect roll thunder ralli great day dc
6 doc 0006                            teapartynevada trump illeg taken care better veteran

1 个答案:

答案 0 :(得分:1)

您可以使用tm::meta()功能向您的tweets-corpus添加元数据。见library(tm); example(meta)

此元数据注释可以在每个语料库级别上发生 - 您可能希望存储&#34;常见&#34;元数据,例如收集此语料库中的推文的日期,搜索查询字符串,API调用详细信息等等。

注释也可以在每个文档级别上发生(在这种情况下,在每个推文级别上) - 您可以在语料库中存储来自trumptweets.df数据框的tweet-attributes,例如转推计数,fav - 计数,语言等。

这意味着聪明而细致的管家。您通常使用一组自定义函数和* apply-family函数来以读写方式调用meta()。 (或使用purrr :: walk *,或purrr :: map *)

我把这个写在了我的头顶。自从我使用meta()以来已经有一段时间了。也许有一种完全不同的方式(使用嵌套数据框,或使用其他文本挖掘包)。