Question

我做了一些有趣的twitter-mining。我使用了twitters streaming-APi并在之前和之后的足球比赛中记录了推文。现在我想做一个ggplot2图表，显示足球比赛中推特的频率。

在原始数据框中，每条推文有一行，变量“created_at”包含这样的日期：Sat Dec 13 13:04:34 +0000 2014

然后我改变了这样的时间格式

推文$ format＆lt; - as.POSIXct（tweets $ created_at，format =“％a％b％d％H：％M：％S％z％Y”，tz =“”）一个

得到了这个2014-12-13 14:04:34 CET。因为我不需要约会，我想，我可以摆脱它

推文$ Uhrzeit＆lt; - sub（“。*”，“”，推文$格式）

有了这个，我只剩下时间14:04:34。

我的问题是，我想以每分钟推文的准确度来分析推文频率。我如何每分钟汇总推文？正如我之前所说，每一行都是一条推文。我用时间和第二个包含“1”的变量创建了一个数据帧。现在我想计算（聚合，总和）每分钟的第二个变量。我试图找到一个解决方案，阅读有关动物园图书馆和chron-library的内容，但这让我很困惑。

希望，有人可以帮助我。

编辑：可重复的数据数据框是这个的一个子集：名称（推文）

 [1] "X"                         "text"                      "retweet_count"            
 [4] "favorited"                 "truncated"                 "id_str"                   
 [7] "in_reply_to_screen_name"   "source"                    "retweeted"                
[10] "created_at"                "in_reply_to_status_id_str" "in_reply_to_user_id_str"  
[13] "lang"                      "listed_count"              "verified"                 
[16] "location"                  "user_id_str"               "description"              
[19] "geo_enabled"               "user_created_at"           "statuses_count"           
[22] "followers_count"           "favourites_count"          "protected"                
[25] "user_url"                  "name"                      "time_zone"                
[28] "user_lang"                 "utc_offset"                "friends_count"            
[31] "screen_name"               "country_code"              "country"                  
[34] "place_type"                "full_name"                 "place_name"               
[37] "place_id"                  "place_lat"                 "place_lon"                
[40] "lat"                       "lon"                       "expanded_url"             
[43] "url"                       "timeformat"

我将“created_at”变量转换为“timeformat”变量，如下所示：

tweets.df<-as.data.frame(cbind(c("2014-12-13 14:04:34 CET","2014-12-13 14:04:37 CET","2014-12-13 14:04:45 CET","2014-12-13 14:05:23 CET","2014-12-13 14:05:53 CET","2014-12-13 14:05:58 CET","2014-12-13 14:06:33 CET","2014-12-13 14:06:38 CET","2014-12-13 14:06:59 CET","2014-12-13 14:08:16 CET","2014-12-13 14:09:12 CET","2014-12-13 14:09:34 CET","2014-12-13 14:10:05 CET","2014-12-13 14:10:16 CET","2014-12-13 14:10:17 CET","2014-12-13 14:11:13 CET","2014-12-13 14:11:16 CET","2014-12-13 14:12:01 CET","2014-12-13 14:12:30 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:22 CET","2014-12-13 14:14:48 CET","2014-12-13 14:15:02 CET","2014-12-13 14:15:03 CET","2014-12-13 14:16:20 CET","2014-12-13 14:16:26 CET","2014-12-13 14:17:14 CET","2014-12-13 14:17:24 CET","2014-12-13 14:17:45 CET","2014-12-13 14:17:49 CET","2014-12-13 14:18:05 CET","2014-12-13 14:18:30 CET","2014-12-13 14:19:38 CET"),1))
colnames(tweets.df)<-c("time","freq")

我刚刚绘制了数据。 stat =“bin”，它将bin的默认值设置为数据范围的1/30。每分钟更好。

ggplot(tweets,aes(x=timeformat)) + geom_line(stat="bin")

enter image description here

Answer 1

证实您的示例数据集：

tweets.df<-as.data.frame(cbind(c("2014-12-13 14:04:34 CET","2014-12-13 14:04:37 CET","2014-12-13 14:04:45 CET","2014-12-13 14:05:23 CET","2014-12-13 14:05:53 CET","2014-12-13 14:05:58 CET","2014-12-13 14:06:33 CET","2014-12-13 14:06:38 CET","2014-12-13 14:06:59 CET","2014-12-13 14:08:16 CET","2014-12-13 14:09:12 CET","2014-12-13 14:09:34 CET","2014-12-13 14:10:05 CET","2014-12-13 14:10:16 CET","2014-12-13 14:10:17 CET","2014-12-13 14:11:13 CET","2014-12-13 14:11:16 CET","2014-12-13 14:12:01 CET","2014-12-13 14:12:30 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:22 CET","2014-12-13 14:14:48 CET","2014-12-13 14:15:02 CET","2014-12-13 14:15:03 CET","2014-12-13 14:16:20 CET","2014-12-13 14:16:26 CET","2014-12-13 14:17:14 CET","2014-12-13 14:17:24 CET","2014-12-13 14:17:45 CET","2014-12-13 14:17:49 CET","2014-12-13 14:18:05 CET","2014-12-13 14:18:30 CET","2014-12-13 14:19:38 CET"),1), stringsAsFactors=FALSE)
colnames(tweets.df)<-c("time","freq")

首先，您的时间列包含文本字符串，您需要POSIXct对象：

tweets.df$time <- as.POSIXct(tweets.df$time)

然后，使用函数cut.POSIXt：

完成分钟分箱

by.mins <- cut.POSIXt(tweets.df$time,"mins")

然后，您希望使用此分割数据框，并对子集上的freq列求和：

tweets.mins <- split(tweets.df, by.mins)
sapply(tweets.mins,function(x)sum(as.integer(x$freq)))
2014-12-13 14:04:00 2014-12-13 14:05:00 2014-12-13 14:06:00 2014-12-13 14:07:00 2014-12-13 14:08:00 
                  3                   3                   3                   0                   1 
2014-12-13 14:09:00 2014-12-13 14:10:00 2014-12-13 14:11:00 2014-12-13 14:12:00 2014-12-13 14:13:00 
                  2                   3                   2                   2                   0 
2014-12-13 14:14:00 2014-12-13 14:15:00 2014-12-13 14:16:00 2014-12-13 14:17:00 2014-12-13 14:18:00 
                 20                   2                   2                   4                   2 
2014-12-13 14:19:00 
                  1

在这种情况下，由于freq始终等于1，因此相当于使用table(by.mins)。

如何每分钟汇总推文

1 个答案: