我正在Pandas DataFrame的子集上创建一个计数函数,并打算导出仅由groupby条件和计数结果组成的字典/电子表格数据。
In [1]: df = pd.DataFrame([[Buy, A, 123, NEW, 500, 20190101-09:00:00am], [Buy, A, 124, CXL, 500, 20190101-09:00:01am], [Buy, A, 125, NEW, 500, 20190101-09:00:03am], [Buy, A, 126, REPLACE, 300, 20190101-09:00:10am], [Buy, B, 210, NEW, 1000, 20190101-09:10:00am], [Sell, B, 345, NEW, 200, 20190101-09:00:00am], [Sell, C, 412, NEW, 100, 20190101-09:00:00am], [Sell, C, 413, NEW, 200, 20190101-09:01:00am], [Sell, C, 414, CXL, 50, 20190101-09:02:00am]], columns=['side', 'sender', 'id', 'type', ''quantity', 'receive_time'])
Out[1]:
side sender id type quantity receive_time
0 Buy A 123 NEW 500 20190101-09:00:00am
1 Buy A 124 CXL 500 20190101-09:00:01am
2 Buy A 125 NEW 500 20190101-09:00:03am
3 Buy A 126 REPLACE 300 20190101-09:00:10am
4 Buy B 210 NEW 1000 20190101-09:10:00am
5 Buy B 345 NEW 200 20190101-09:00:00am
6 Sell C 412 NEW 100 20190101-09:00:00am
7 Sell C 413 NEW 200 20190101-09:01:00am
8 Sell C 414 CXL 50 20190101-09:02:00am
count函数如下(mydf作为数据帧的子集传入):
def ordercount(mydf):
num = 0.0
if mydf.type == 'NEW':
num = num + mydf.qty
elif mydf.type == 'REPLACE':
num = mydf.qty
elif mydf.type == 'CXL':
num = num - mydf.qty
else:
pass
orderdict = dict.fromkeys([mydf.side, mydf.sender, mydf.id], num)
return orderdict
从csv中读取数据后,我按一些标准将其分组,还按时间排序:
df = pd.read_csv('xxxxxxxxx.csv, sep='|', header=0, engine='python', names=col_names)
sorted_df = df.groupby(['side', 'sender', 'id']).apply(lambda_df:_df.sort_values(by=['time']))
然后对排序后的数据调用先前定义的函数:
print(sorted_df.agg(ordercount))
但是值错误不断增加,导致无法调用太多行。
对数据进行计数的功能方式可能并不高效,但它是我想到的与订单类型匹配并相应地对数量进行计数的最直接的方法。我希望程序输出一张只显示边,发件人,身份证和计数数量的表。有什么办法可以做到这一点?谢谢。
预期输出:
side sender total_order_num trade_date
0 Buy A 300 20190101
1 Buy B 1200 20190101
2 Sell C 250 20190101
答案 0 :(得分:0)
我相信您的函数不容易一次应用,因为您根据行执行不同的操作。如果您仅将+
和-
作为操作,但是在某个时候replace
进行操作,然后继续进行其他操作,则可以。因此,循环可能会更简单,或者您可以花一些时间来拥有一个不错的功能来完成任务。
这就是我所拥有的。我真正要做的就是更改您的ordercount
,使其直接作用于子集,而您只需简单地分组即可。您可以在分组之前按时间排序,也可以在ordercount
函数中进行排序。希望这会有所帮助。
import pandas as pd
df = pd.DataFrame([['Buy', 'A', 123, 'NEW', 500, '20190101-09:00:00am'],
['Buy', 'A', 124, 'CXL', 500, '20190101-09:00:01am'],
['Buy', 'A', 125, 'NEW', 500, '20190101-09:00:03am'],
['Buy', 'A', 126, 'REPLACE', 300, '20190101-09:00:10am'],
['Buy', 'B', 210, 'NEW', 1000, '20190101-09:10:00am'],
['Buy', 'B', 345, 'NEW', 200, '20190101-09:00:00am'],
['Sell', 'C', 412, 'NEW', 100, '20190101-09:00:00am'],
['Sell', 'C', 413, 'NEW', 200, '20190101-09:01:00am'],
['Sell', 'C', 414, 'CXL', 50, '20190101-09:02:00am']],
columns=['side', 'sender', 'id', 'type', 'quantity', 'receive_time'])
df['receive_time'] = pd.to_datetime(df['receive_time'])
df['receive_date'] = df['receive_time'].dt.date # you do not need the time stamps
def ordercount(mydf):
mydf_ = mydf.sort_values('receive_time')[['type', 'quantity']].copy()
num = 0
for val in mydf_.values:
type_, quantity = val
# val is going to be a list like ['NEW', 500]. All I am doing above is unpack the list into two variables.
# You can find many resources on unpacking iterables
if type_ == 'NEW':
num += quantity
elif type_ == 'REPLACE':
num = quantity
elif type_ == 'CXL':
num -= quantity
else:
pass
return num
mydf = df.groupby(['side', 'sender', 'receive_date']).apply(ordercount).reset_index()
输出:
|----|--------|----------|---------------------|------|
| | side | sender | receive_date | 0 |
|----|--------|----------|---------------------|------|
| 0 | Buy | A | 2019-01-01 00:00:00 | 300 |
|----|--------|----------|---------------------|------|
| 1 | Buy | B | 2019-01-01 00:00:00 | 1200 |
|----|--------|----------|---------------------|------|
| 2 | Sell | C | 2019-01-01 00:00:00 | 250 |
|----|--------|----------|---------------------|------|
您可以根据需要轻松地重命名列“ 0”。我仍然不确定trade_date
的定义。您的数据只有一个日期吗?如果您有多个约会,该怎么办?你在忙吗?...
编辑:如果您对此数据框进行过尝试,则可以看到日期按预期工作的组。
df = pd.DataFrame([['Buy', 'A', 123, 'NEW', 500, '20190101-09:00:00am'],
['Buy', 'A', 124, 'CXL', 500, '20190101-09:00:01am'],
['Buy', 'A', 125, 'NEW', 500, '20190101-09:00:03am'],
['Buy', 'A', 126, 'REPLACE', 300, '20190101-09:00:10am'],
['Buy', 'B', 210, 'NEW', 1000, '20190101-09:10:00am'],
['Buy', 'B', 345, 'NEW', 200, '20190101-09:00:00am'],
['Sell', 'C', 412, 'NEW', 100, '20190101-09:00:00am'],
['Sell', 'C', 413, 'NEW', 200, '20190101-09:01:00am'],
['Sell', 'C', 414, 'CXL', 50, '20190101-09:02:00am'],
['Buy', 'A', 123, 'NEW', 500, '20190102-09:00:00am'],
['Buy', 'A', 124, 'CXL', 500, '20190102-09:00:01am'],
['Buy', 'A', 125, 'NEW', 500, '20190102-09:00:03am'],
['Buy', 'A', 126, 'REPLACE', 300, '20190102-09:00:10am'],
['Buy', 'B', 210, 'NEW', 1000, '20190102-09:10:00am'],
['Buy', 'B', 345, 'NEW', 200, '20190102-09:00:00am'],
['Sell', 'C', 412, 'NEW', 100, '20190102-09:00:00am'],
['Sell', 'C', 413, 'NEW', 200, '20190102-09:01:00am'],
['Sell', 'C', 414, 'CXL', 50, '20190102-09:02:00am']],
columns=['side', 'sender', 'id', 'type', 'quantity', 'receive_time'])