我正在处理数据集
我必须对其进行分组并计算特定时间段内有多少请求,然后很容易将图表 时间与请求数量相比较。 < /强>
示例
**timestamp - number of request**
21-06-2016 09:00:00 - 2
21-06-2016 10:00:00 - 1
21-06-2016 11:00:00 - 5
我该如何计算?
感谢
P.S我尝试使用data['timestamp'].value_counts()
但出现错误:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
dateparse = lambda dates: pd.datetime.strptime(dates, '%d-%m-%Y %H:%M:%S')
data = pd.read_csv('/home/amfirnas/Desktop/localhost_access_log.2016-06-21.csv',
parse_dates=['timestamp'], index_col='timestamp',date_parser=dateparse)
print data.head(25)
# print data['time'].value_counts()
# print data.groupby(['time']).groups.keys()
ts = data['timestamp'].value_counts()
# plt.plot(ts)
# plt.show()
答案 0 :(得分:0)
阅读文件:
df = pd.read_csv('/home/local/sayali/Downloads/dataset-server_logs.csv')
[In]:df
host timestamp status byte
0 192.168.102.100 21-06-2016 09:54:44 200 17811
1 192.168.102.100 21-06-2016 09:54:44 200 21630
2 192.168.100.160 21-06-2016 10:08:08 404 1098
3 192.168.100.160 21-06-2016 11:20:44 200 17811
4 192.168.100.160 21-06-2016 11:20:44 200 21630
5 192.168.102.100 21-06-2016 11:54:44 200 17811
6 192.168.102.100 21-06-2016 11:54:44 200 21630
7 192.168.102.100 21-06-2016 11:54:44 200 21630
ts = pd.DataFrame(df['timestamp'].value_counts()))
ts
Out[15]:
timestamp
2016-06-21 11:54:44 3
2016-06-21 09:54:44 2
2016-06-21 11:20:44 2
2016-06-21 10:08:08 1
#Convert index to datetime format using pd.to_datetime()
ts.index = pd.to_datetime(ts.index)
# PLOT
plt.title('Number of Requests based on timestamp')
plt.xlabel('Timestamp')
plt.ylabel('Total number of Requests')
#Change xticks orientation to vertical
plt.xticks(rotation='vertical')
plt.plot(ts)
答案 1 :(得分:0)
如果你想计算每小时的数量,而不是value_count(),你可以对它们进行分组然后计数,为此,确保你的时间戳是pandas datetime:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.groupby(pd.Grouper(key='timestamp', freq="1H")).count()