我做了一些有趣的twitter-mining。我使用了twitters streaming-APi并在之前和之后的足球比赛中记录了推文。现在我想做一个ggplot2图表,显示足球比赛中推特的频率。
在原始数据框中,每条推文有一行,变量“created_at”包含这样的日期:Sat Dec 13 13:04:34 +0000 2014
然后我改变了这样的时间格式
推文$ format< - as.POSIXct(tweets $ created_at,format =“%a%b%d%H:%M:%S%z%Y”,tz =“”) 一个
得到了这个2014-12-13 14:04:34 CET
。因为我不需要约会,我想,我可以摆脱它
推文$ Uhrzeit< - sub(“。*”,“”,推文$格式)
有了这个,我只剩下时间14:04:34
。
我的问题是,我想以每分钟推文的准确度来分析推文频率。我如何每分钟汇总推文?正如我之前所说,每一行都是一条推文。我用时间和第二个包含“1”的变量创建了一个数据帧。现在我想计算(聚合,总和)每分钟的第二个变量。我试图找到一个解决方案,阅读有关动物园图书馆和chron-library的内容,但这让我很困惑。
希望,有人可以帮助我。
编辑:可重复的数据 数据框是这个的一个子集:名称(推文)
[1] "X" "text" "retweet_count"
[4] "favorited" "truncated" "id_str"
[7] "in_reply_to_screen_name" "source" "retweeted"
[10] "created_at" "in_reply_to_status_id_str" "in_reply_to_user_id_str"
[13] "lang" "listed_count" "verified"
[16] "location" "user_id_str" "description"
[19] "geo_enabled" "user_created_at" "statuses_count"
[22] "followers_count" "favourites_count" "protected"
[25] "user_url" "name" "time_zone"
[28] "user_lang" "utc_offset" "friends_count"
[31] "screen_name" "country_code" "country"
[34] "place_type" "full_name" "place_name"
[37] "place_id" "place_lat" "place_lon"
[40] "lat" "lon" "expanded_url"
[43] "url" "timeformat"
我将“created_at”变量转换为“timeformat”变量,如下所示:
tweets.df<-as.data.frame(cbind(c("2014-12-13 14:04:34 CET","2014-12-13 14:04:37 CET","2014-12-13 14:04:45 CET","2014-12-13 14:05:23 CET","2014-12-13 14:05:53 CET","2014-12-13 14:05:58 CET","2014-12-13 14:06:33 CET","2014-12-13 14:06:38 CET","2014-12-13 14:06:59 CET","2014-12-13 14:08:16 CET","2014-12-13 14:09:12 CET","2014-12-13 14:09:34 CET","2014-12-13 14:10:05 CET","2014-12-13 14:10:16 CET","2014-12-13 14:10:17 CET","2014-12-13 14:11:13 CET","2014-12-13 14:11:16 CET","2014-12-13 14:12:01 CET","2014-12-13 14:12:30 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:22 CET","2014-12-13 14:14:48 CET","2014-12-13 14:15:02 CET","2014-12-13 14:15:03 CET","2014-12-13 14:16:20 CET","2014-12-13 14:16:26 CET","2014-12-13 14:17:14 CET","2014-12-13 14:17:24 CET","2014-12-13 14:17:45 CET","2014-12-13 14:17:49 CET","2014-12-13 14:18:05 CET","2014-12-13 14:18:30 CET","2014-12-13 14:19:38 CET"),1))
colnames(tweets.df)<-c("time","freq")
我刚刚绘制了数据。 stat =“bin”,它将bin的默认值设置为数据范围的1/30。每分钟更好。
ggplot(tweets,aes(x=timeformat)) + geom_line(stat="bin")
答案 0 :(得分:2)
证实您的示例数据集:
tweets.df<-as.data.frame(cbind(c("2014-12-13 14:04:34 CET","2014-12-13 14:04:37 CET","2014-12-13 14:04:45 CET","2014-12-13 14:05:23 CET","2014-12-13 14:05:53 CET","2014-12-13 14:05:58 CET","2014-12-13 14:06:33 CET","2014-12-13 14:06:38 CET","2014-12-13 14:06:59 CET","2014-12-13 14:08:16 CET","2014-12-13 14:09:12 CET","2014-12-13 14:09:34 CET","2014-12-13 14:10:05 CET","2014-12-13 14:10:16 CET","2014-12-13 14:10:17 CET","2014-12-13 14:11:13 CET","2014-12-13 14:11:16 CET","2014-12-13 14:12:01 CET","2014-12-13 14:12:30 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:22 CET","2014-12-13 14:14:48 CET","2014-12-13 14:15:02 CET","2014-12-13 14:15:03 CET","2014-12-13 14:16:20 CET","2014-12-13 14:16:26 CET","2014-12-13 14:17:14 CET","2014-12-13 14:17:24 CET","2014-12-13 14:17:45 CET","2014-12-13 14:17:49 CET","2014-12-13 14:18:05 CET","2014-12-13 14:18:30 CET","2014-12-13 14:19:38 CET"),1), stringsAsFactors=FALSE)
colnames(tweets.df)<-c("time","freq")
首先,您的时间列包含文本字符串,您需要POSIXct对象:
tweets.df$time <- as.POSIXct(tweets.df$time)
然后,使用函数cut.POSIXt
:
by.mins <- cut.POSIXt(tweets.df$time,"mins")
然后,您希望使用此分割数据框,并对子集上的freq
列求和:
tweets.mins <- split(tweets.df, by.mins)
sapply(tweets.mins,function(x)sum(as.integer(x$freq)))
2014-12-13 14:04:00 2014-12-13 14:05:00 2014-12-13 14:06:00 2014-12-13 14:07:00 2014-12-13 14:08:00
3 3 3 0 1
2014-12-13 14:09:00 2014-12-13 14:10:00 2014-12-13 14:11:00 2014-12-13 14:12:00 2014-12-13 14:13:00
2 3 2 2 0
2014-12-13 14:14:00 2014-12-13 14:15:00 2014-12-13 14:16:00 2014-12-13 14:17:00 2014-12-13 14:18:00
20 2 2 4 2
2014-12-13 14:19:00
1
在这种情况下,由于freq
始终等于1,因此相当于使用table(by.mins)
。