Question

我有很多时间戳客户交易数据。我从平面文件中读取数据块。我想知道如何有效地找到客户完成的最新（最后）交易。

示例数据：

id    login    transaction_id    transaction_date
 1    asdf                  1   13-10-2015 15:30:45
 2    fghd                  2   13-10-2015 16:30:45
 4    rteu                  3   13-10-2015 17:30:45
 2    fghd                  4   13-10-2015 18:30:45
 3    rtey                  5   13-10-2015 19:30:45
 5    lkiu                  6   13-10-2015 20:30:45

在这个示例数据中，我想获得以下数据帧。这些数据可以跨文件分割。

login    transaction_count    last_transaction_id
asdf               1              1
fghd               2              4
rtey               1              5
rteu               1              3
lkiu               1              6

Answer 1

这里有一些很长的代码可以帮到你想要的东西：

# First, keep ready a groupby of the 'name' column 
names_by = df.groupby('name')

# next get the total count of each time a name has occurred
data = name_groupby.agg({'transaction_id':np.count_nonzero}).reset_index()

# right now, the dataframe that we want is like such:
   name  transaction_id
0  asdf               6
1  fghd               2
2  lkiu               1
3  rteu               1
4  rtey               1

# next we would want to get the latest transaction id
# for which, on a per-name basis we set the 'transaction_date'
# as index and get the last transaction_id

transid = lambda index: names_by.get_group(index).set_index('transaction_date')['transaction_id'][-1]


# and the last step is to set to our target dataframe

data['last_transaction_id'] = data['name'].apply(lambda v: transid(v))

# and, when we print 'data' we get:
      name  transaction_id  last_transaction_id
   0  asdf               6                   12
   1  fghd               2                    4
   2  lkiu               1                    6
   3  rteu               1                    3
   4  rtey               1                    5

Answer 2

如果您想按交易日期采取最新交易：

In [43]: res = df.sort('transaction_date', ascending=False).groupby('login').agg({'transaction_id': ['size', 'first']})
In [44]: res.columns = ['transaction_count', 'last_transaction_id']
In [46]: res
Out[46]: 
       transaction_count  last_transaction_id
login                                        
asdf                   1                    1
fghd                   2                    4
lkiu                   1                    6
rteu                   1                    3
rtey                   1                    5

或者，如果您只想为每个组获取最大ID，那么它会更容易：

In [47]: res = df.groupby('login').agg({'transaction_id': ['size', 'max']})
In [48]: res.columns = ['transaction_count', 'last_transaction_id']
In [49]: res
Out[49]: 
       transaction_count  last_transaction_id
login                                        
asdf                   1                    1
fghd                   2                    4
lkiu                   1                    6
rteu                   1                    3
rtey                   1                    5

pandas：在时间序列数据中找到索引的最新值

2 个答案: