我有一个像
这样的文件数据ID address used_at active_seconds pageviews
bcb0cc3d7f01dc6297f1331362a0fc09 avito.ru 2014-03-17 00:24:47 148 3
bcb0cc3d7f01dc6297f1331362a0fc09 avito.ru 2014-03-17 01:08:29 34 4
bcb0cc3d7f01dc6297f1331362a0fc09 avito.ru 2014-05-02 17:47:39 22 1
bcb0cc3d7f01dc6297f1331362a0fc09 avito.ru 2015-01-03 01:37:05 224 5
bcb0cc3d7f01dc6297f1331362a0fc09 e1.ru 2015-01-11 03:49:50 54 1
bcb0cc3d7f01dc6297f1331362a0fc09 avito.ru 2015-03-10 22:11:01 26 7
bcb0cc3d7f01dc6297f1331362a0fc09 avito.ru 2015-03-25 03:02:07 22 4
690ef4613fd977f9c29e1124b9d5814c avito.ru 2014-02-05 09:25:56 6 3
690ef4613fd977f9c29e1124b9d5814c avito.ru 2014-03-18 11:27:49 244 14
那里的所有档案file_with_data 我需要打印用户的优先级如何与2014年和2015年的网站相比发生了变化。也就是说,有必要依靠他们最初所在的网站以及当时的位置。
我想我需要一个循环
infile = pd.read_csv("avito_trend.csv", parse_dates=[2])
for id in infile['ID'].nunique():
但它不起作用。 如何绕过所有唯一身份证并获取有关访问的信息?
答案 0 :(得分:1)
你可以这样做:
import pandas as pd
cols = ['ID', 'address', 'used_at']
df = pd.read_csv(r'avito_trend.csv', parse_dates=['used_at'], usecols=cols)
# sort DF by ID, Timestamp, address
df.sort_values(['ID','used_at','address'], inplace=True)
# adding helper columns: 'prev_address' and 'time_diff'
df['prev_address'] = df['address'].shift()
df['time_diff'] = df['used_at'] - df['used_at'].shift()
# exclude those where `address` == `prev_address`
df = df[df['address'] != df['prev_address']]
# exclude those with `time_diff` > 10 minutes (please set desired value)
df = df[df['time_diff'] <= pd.Timedelta('10min')]
# group by (address, prev_address, df.used_at.dt.year) and count results
df[['ID','address','prev_address']] \
.groupby(['address','prev_address', df.used_at.dt.year]) \
.count() \
.reset_index()
注意:请注意pd.Timedelta('10min')
- 您可能需要调整时间差值
更新:将year
添加到groupby()
In [15]: df[['ID','address','prev_address']].groupby(['address','prev_address', df.used_at.dt.year]).count().reset_index()
Out[15]:
address prev_address used_at ID
0 am.ru auto.ru 2014 103
1 am.ru auto.ru 2015 135
2 am.ru avito.ru 2014 133
3 am.ru avito.ru 2015 31
4 am.ru avtomarket.ru 2014 14
5 am.ru avtomarket.ru 2015 6
6 am.ru cars.mail.ru/sale 2014 17
7 am.ru cars.mail.ru/sale 2015 8
8 am.ru drom.ru 2014 65
9 am.ru drom.ru 2015 29
10 am.ru e1.ru 2014 33
11 am.ru e1.ru 2015 17
12 am.ru irr.ru/cars 2014 26
13 am.ru irr.ru/cars 2015 20
14 auto.ru am.ru 2014 86
15 auto.ru am.ru 2015 77
16 auto.ru avito.ru 2014 1316
17 auto.ru avito.ru 2015 1052
18 auto.ru avtomarket.ru 2014 39
19 auto.ru avtomarket.ru 2015 32
20 auto.ru cars.mail.ru/sale 2014 94
21 auto.ru cars.mail.ru/sale 2015 31
22 auto.ru drom.ru 2014 219
23 auto.ru drom.ru 2015 205
24 auto.ru e1.ru 2014 174
25 auto.ru e1.ru 2015 84
26 auto.ru irr.ru/cars 2014 89
27 auto.ru irr.ru/cars 2015 41
28 avito.ru am.ru 2014 109
29 avito.ru am.ru 2015 49
.. ... ... ... ...
<强> UPDATE2:强>
如果要将两列合并为一列:
new = df[['ID','address','prev_address']].groupby(['address','prev_address', df.used_at.dt.year]).count().reset_index()
new['visit'] = new['prev_address'] + ' -> ' + new['address']