我有很多时间戳客户交易数据。我从平面文件中读取数据块。我想知道如何有效地找到客户完成的最新(最后)交易。
示例数据:
id login transaction_id transaction_date
1 asdf 1 13-10-2015 15:30:45
2 fghd 2 13-10-2015 16:30:45
4 rteu 3 13-10-2015 17:30:45
2 fghd 4 13-10-2015 18:30:45
3 rtey 5 13-10-2015 19:30:45
5 lkiu 6 13-10-2015 20:30:45
在这个示例数据中,我想获得以下数据帧。这些数据可以跨文件分割。
login transaction_count last_transaction_id
asdf 1 1
fghd 2 4
rtey 1 5
rteu 1 3
lkiu 1 6
答案 0 :(得分:1)
这里有一些很长的代码可以帮到你想要的东西:
# First, keep ready a groupby of the 'name' column
names_by = df.groupby('name')
# next get the total count of each time a name has occurred
data = name_groupby.agg({'transaction_id':np.count_nonzero}).reset_index()
# right now, the dataframe that we want is like such:
name transaction_id
0 asdf 6
1 fghd 2
2 lkiu 1
3 rteu 1
4 rtey 1
# next we would want to get the latest transaction id
# for which, on a per-name basis we set the 'transaction_date'
# as index and get the last transaction_id
transid = lambda index: names_by.get_group(index).set_index('transaction_date')['transaction_id'][-1]
# and the last step is to set to our target dataframe
data['last_transaction_id'] = data['name'].apply(lambda v: transid(v))
# and, when we print 'data' we get:
name transaction_id last_transaction_id
0 asdf 6 12
1 fghd 2 4
2 lkiu 1 6
3 rteu 1 3
4 rtey 1 5
答案 1 :(得分:1)
如果您想按交易日期采取最新交易:
In [43]: res = df.sort('transaction_date', ascending=False).groupby('login').agg({'transaction_id': ['size', 'first']})
In [44]: res.columns = ['transaction_count', 'last_transaction_id']
In [46]: res
Out[46]:
transaction_count last_transaction_id
login
asdf 1 1
fghd 2 4
lkiu 1 6
rteu 1 3
rtey 1 5
或者,如果您只想为每个组获取最大ID,那么它会更容易:
In [47]: res = df.groupby('login').agg({'transaction_id': ['size', 'max']})
In [48]: res.columns = ['transaction_count', 'last_transaction_id']
In [49]: res
Out[49]:
transaction_count last_transaction_id
login
asdf 1 1
fghd 2 4
lkiu 1 6
rteu 1 3
rtey 1 5