如何使用pandas确定每个唯一用户的优先级操作

时间:2016-03-25 14:13:11

标签: python csv pandas statistics

我有一个像

这样的文件数据
ID                                   address     used_at      active_seconds    pageviews
bcb0cc3d7f01dc6297f1331362a0fc09    avito.ru    2014-03-17 00:24:47   148   3
bcb0cc3d7f01dc6297f1331362a0fc09    avito.ru    2014-03-17 01:08:29   34    4
bcb0cc3d7f01dc6297f1331362a0fc09    avito.ru    2014-05-02 17:47:39   22    1
bcb0cc3d7f01dc6297f1331362a0fc09    avito.ru    2015-01-03 01:37:05   224   5
bcb0cc3d7f01dc6297f1331362a0fc09    e1.ru       2015-01-11 03:49:50   54    1
bcb0cc3d7f01dc6297f1331362a0fc09    avito.ru    2015-03-10 22:11:01   26    7
bcb0cc3d7f01dc6297f1331362a0fc09    avito.ru    2015-03-25 03:02:07   22    4
690ef4613fd977f9c29e1124b9d5814c    avito.ru    2014-02-05 09:25:56   6     3
690ef4613fd977f9c29e1124b9d5814c    avito.ru    2014-03-18 11:27:49   244   14

那里的所有档案file_with_data 我需要打印用户的优先级如何与2014年和2015年的网站相比发生了变化。也就是说,有必要依靠他们最初所在的网站以及当时的位置。

我想我需要一个循环

infile = pd.read_csv("avito_trend.csv", parse_dates=[2])
for id in infile['ID'].nunique():

但它不起作用。 如何绕过所有唯一身份证并获取有关访问的信息?

1 个答案:

答案 0 :(得分:1)

你可以这样做:

import pandas as pd

cols = ['ID', 'address', 'used_at']
df = pd.read_csv(r'avito_trend.csv', parse_dates=['used_at'], usecols=cols)

# sort DF by ID, Timestamp, address
df.sort_values(['ID','used_at','address'], inplace=True)

# adding helper columns: 'prev_address' and 'time_diff'
df['prev_address'] = df['address'].shift()
df['time_diff'] = df['used_at'] - df['used_at'].shift()

# exclude those where  `address` == `prev_address`
df = df[df['address'] != df['prev_address']]
# exclude those with `time_diff` > 10 minutes (please set desired value)
df = df[df['time_diff'] <= pd.Timedelta('10min')]

# group by (address, prev_address, df.used_at.dt.year) and count results
df[['ID','address','prev_address']] \
  .groupby(['address','prev_address', df.used_at.dt.year]) \
  .count() \
  .reset_index()

注意:请注意pd.Timedelta('10min') - 您可能需要调整时间差值

更新:year添加到groupby()

In [15]: df[['ID','address','prev_address']].groupby(['address','prev_address', df.used_at.dt.year]).count().reset_index()
Out[15]:
         address       prev_address  used_at    ID
0          am.ru            auto.ru     2014   103
1          am.ru            auto.ru     2015   135
2          am.ru           avito.ru     2014   133
3          am.ru           avito.ru     2015    31
4          am.ru      avtomarket.ru     2014    14
5          am.ru      avtomarket.ru     2015     6
6          am.ru  cars.mail.ru/sale     2014    17
7          am.ru  cars.mail.ru/sale     2015     8
8          am.ru            drom.ru     2014    65
9          am.ru            drom.ru     2015    29
10         am.ru              e1.ru     2014    33
11         am.ru              e1.ru     2015    17
12         am.ru        irr.ru/cars     2014    26
13         am.ru        irr.ru/cars     2015    20
14       auto.ru              am.ru     2014    86
15       auto.ru              am.ru     2015    77
16       auto.ru           avito.ru     2014  1316
17       auto.ru           avito.ru     2015  1052
18       auto.ru      avtomarket.ru     2014    39
19       auto.ru      avtomarket.ru     2015    32
20       auto.ru  cars.mail.ru/sale     2014    94
21       auto.ru  cars.mail.ru/sale     2015    31
22       auto.ru            drom.ru     2014   219
23       auto.ru            drom.ru     2015   205
24       auto.ru              e1.ru     2014   174
25       auto.ru              e1.ru     2015    84
26       auto.ru        irr.ru/cars     2014    89
27       auto.ru        irr.ru/cars     2015    41
28      avito.ru              am.ru     2014   109
29      avito.ru              am.ru     2015    49
..           ...                ...      ...   ...

<强> UPDATE2:

如果要将两列合并为一列:

new = df[['ID','address','prev_address']].groupby(['address','prev_address', df.used_at.dt.year]).count().reset_index()

new['visit'] = new['prev_address'] + ' -> ' + new['address']