我有数据
data id url size domain subdomain
13/Jun/2016:06:27:26 30055 https://api.weather.com/v1/geocode/55.740002/37.610001/aggregate.json?apiKey=e45ff1b7c7bda231216c7ab7c33509b8&products=conditionsshort,fcstdaily10short,fcsthourly24short,nowlinks 3929 weather.com api.weather.com
13/Jun/2016:06:27:26 30055 https://api.weather.com/v1/geocode/54.720001/20.469999/aggregate.json?apiKey=e45ff1b7c7bda231216c7ab7c33509b8&products=conditionsshort,fcstdaily10short,fcsthourly24short,nowlinks 3845 weather.com api.weather.com
13/Jun/2016:06:27:27 3845 https://api.weather.com/v1/geocode/54.970001/73.370003/aggregate.json?apiKey=e45ff1b7c7bda231216c7ab7c33509b8&products=conditionsshort,fcstdaily10short,fcsthourly24short,nowlinks 30055 weather.com api.weather.com
13/Jun/2016:06:27:27 30055 https://api.weather.com/v1/geocode/59.919998/30.219999/aggregate.json?apiKey=e45ff1b7c7bda231216c7ab7c33509b8&products=conditionsshort,fcstdaily10short,fcsthourly24short,nowlinks 3914 weather.com api.weather.com
13/Jun/2016:06:27:28 30055 https://facebook.com 4005 facebook.com facebook.com
我需要间隔5分钟对它进行分组。 欲望输出
data id url size domain subdomain
13/Jun/2016:06:27:26 30055 https://api.weather.com/v1/geocode/55.740002/37.610001/aggregate.json?apiKey=e45ff1b7c7bda231216c7ab7c33509b8&products=conditionsshort,fcstdaily10short,fcsthourly24short,nowlinks 3929 weather.com api.weather.com
13/Jun/2016:06:27:27 3845 https://api.weather.com/v1/geocode/54.970001/73.370003/aggregate.json?apiKey=e45ff1b7c7bda231216c7ab7c33509b8&products=conditionsshort,fcstdaily10short,fcsthourly24short,nowlinks 30055 weather.com api.weather.com
13/Jun/2016:06:27:28 30055 https://facebook.com 4005 facebook.com facebook.com
我需要id, subdomain
分组并建立间隔5min
我尝试使用
print df.groupby([df['data'],pd.TimeGrouper(freq='Min')])
以分钟为先分组,但返回TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
答案 0 :(得分:3)
您需要使用data
使用pd.to_datetime()
设置解析format
,并将结果用作index
。然后.groupby()
重新采样到5Min
间隔:
df.index = pd.to_datetime(df.data, format='%d/%b/%Y:%H:%M:%S')
df.groupby(pd.TimeGrouper('5Min')).apply(lambda x: x.groupby(['id', 'subdomain']).first())
data \
data id subdomain
2016-06-13 06:25:00 3845 api.weather.com 13/Jun/2016:06:27:27
30055 api.weather.com 13/Jun/2016:06:27:26
facebook.com 13/Jun/2016:06:27:28
url \
data id subdomain
2016-06-13 06:25:00 3845 api.weather.com https://api.weather.com/v1/geocode/54.970001/7...
30055 api.weather.com https://api.weather.com/v1/geocode/55.740002/3...
facebook.com https://facebook.com
size domain
data id subdomain
2016-06-13 06:25:00 3845 api.weather.com 30055 weather.com
30055 api.weather.com 3929 weather.com
facebook.com 4005 facebook.com
答案 1 :(得分:1)
请注意,要转换为datetime,您可以传递以下格式:
df['data'] = pd.to_datetime(df['data'], format="%d/%b/%Y:%H:%M:%S")
现在您可以使用groupby:
In [11]: df1 = df.set_index("data")
In [12]: df1.groupby(pd.TimeGrouper("5Min")).sum()
Out[12]:
id size
data
2016-06-13 06:25:00 124065 45748
最好将其作为重新样本编写:
In [13]: df1.resample("5Min").sum()
Out[13]:
id size
data
2016-06-13 06:25:00 124065 45748
答案 2 :(得分:0)
首先需要set_index
检查df.index
是日期时间索引。如果没有,那就是错误的原因