我在csv文件中有180,000行,第三列(Time)看起来像2016-10-20 03:43:11+00:00
(时间是UTC)。那么我如何在Python中绘制一个图表,可以显示在整个csv文件的2小时时间范围的每5分钟间隔中发生了多少这些行(推文)?例如,我有兴趣知道每5分钟间隔发生了多少推文。
CSV文件中的一些示例行如下所示:
Candidate,ID,Time,Username,Tweet
Clinton,788948653016842240,2016-10-20 03:43:11+00:00,Tamayo_castle,Lorem ipsum dolor sit amet, consectetur adipiscing elit
Clinton,788948666501464064,2016-10-20 03:43:14+00:00,ThinkCenter1968,Maecenas congue, sem nec suscipit aliquam, lorem enim pl
Clinton,788948673594097664,2016-10-20 03:43:16+00:00,21stCenRevolt,Curabitur nec condimentum lorem. Aliquam a dolor porta
Both,788948662881751040,2016-10-20 03:43:13+00:00,mikeywan,Ut eu sagittis metus. Phasellus ut vulputate dui, nec malesuada
Both,788948675313696769,2016-10-20 03:43:16+00:00,erwoti,Fusce sit amet aliquet ipsum, quis placerat elit.
Clinton,788948671756955650,2016-10-20 03:43:15+00:00,isaac_urner,te nisi, vitae bibendum odio. Maecenas hen
基本上,我不确定如何将pd.date_range链接到tweets_df,以便它可以在两小时内以5分钟的间隔显示推文的频率(比如直方图格式或任何其他代表性的情节)。 / p>
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
tweets_df = pd.read_csv('valid_tweets.csv')
print(tweets_df)
pd.date_range('10/20/2016 1:55', '10/20/2016 3:55',
freq='5 min', tz='UTC')
答案 0 :(得分:1)
所以我会用pandas> = 0.19:
来做这件事import pandas
import matplotlib.pyplot as plt
FIVEMIN = pandas.offsets.Minute(5)
fig, ax = plt.subplots(figsize=(6, 3.5))
ax = (
pandas.read_csv('data.csv', parse_dates=['Time'])
.resample(FIVEMIN, on='Time')['ID']
.count()
.plot.line(ax=ax)
)
plt.show()
如果您没有使用0.19或更高版本的pandas,则需要明确设置索引:
ax = (
pandas.read_csv('data.csv', parse_dates=['Time'])
.set_index('Time')
.resample(FIVEMIN)['ID']
.count()
.plot.line(ax=ax)
)