我有一个csv文件,其中包含数千条公司库存数据记录。它包含以下整数字段:
low_price, high_price, volume_traded
10, 20, 45667
15, 22, 256565
41, 47, 45645
30, 39, 547343
我的要求是通过在每个价格级别(从低到高)累积volume_traded,从此数据创建新的csv文件。最终结果如下两列:
price, total_volume_traded
10, 45667
11, 45667
12, 45667
....
....
15, 302232
etc
换句话说,最终的csv包含每个价格水平的一条记录(不仅是高/低,还包括介于两者之间的价格),以及该价格水平的volume_traded总量。
我有这个工作,但它非常缓慢和低效。我相信必须有更好的方法来实现这一目标。
基本上我所做的就是使用嵌套循环:
以下是一些相关代码。如果有人能在效率/速度方面建议更好的方法,我将不胜感激:
df_exising = #dataframe created from existing csv
df_new = #dataframe for new Price/Volume values
for index, row in df_existing.iterrows():
price = row['low_price']
for i in range(row['low_price'], row['high_price']+1):
volume = row['volume_traded']
df_new = accumulate_volume(df_new, price, volume)
price+=1
def accumulate_volume(df_new, price, volume):
#If price level already exists, add volume to existing
if df_new['Price'].loc[df_new['Price'] == price].count() > 0:
df_new['Volume'].loc[df_new['Price'] == price] += volume
return(df_new)
else:
#first occurrence of price level, add new row
tmp = {'Price':int(price), 'Volume':volume}
return(df_new.append(tmp, ignore_index=True))
#once the above finishes, df_new is written to the new csv file
我为什么这么慢的猜测至少部分是因为'append'每次调用时都会创建一个新对象,并且它被称为LOT。总的来说,上面代码中的嵌套循环运行了1595653次。
我将非常感谢任何帮助。
答案 0 :(得分:1)
让我们忘记关于方法论潜在问题的一个时刻(考虑一下,如果100k股票以50-51的价格交易,10万股以50-59的价格交易,你的结果会如何。)
以下是一组应该达到目标的评论步骤:
# Initialize DataFrame.
df = pd.DataFrame({'low': [10, 15, 41, 30],
'high': [20, 22, 47, 39],
'volume': [45667, 256565, 45645, 547343]})
# Initialize a price dictionary spanning range of potential prices.
d = {price: 0 for price in range(min(df.low), max(df.high) + 1)}
# Create helper function to add volume to given price bucket.
def add_volume(price_dict, price, volume):
price_dict[price] += volume
# Use a nested list comprehension to call the function and populate the dictionary.
_ = [[add_volume(d, price, volume) for price in range(low, high + 1)]
for low, high, volume in zip(df.low, df.high, df.volume)]
# Convert the dictionary to a DataFrame and output to csv.
idx = pd.Index(d.keys(), name='price')
df = pd.DataFrame(d.values(), index=idx, columns=['total_volume_traded'])
df.to_csv('output.csv')
>>> df
total_volume_traded
price
10 45667
11 45667
12 45667
13 45667
14 45667
15 302232
16 302232
17 302232
18 302232
19 302232
20 302232
21 256565
22 256565
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 547343
31 547343
32 547343
33 547343
34 547343
35 547343
36 547343
37 547343
38 547343
39 547343
40 0
41 45645
42 45645
43 45645
44 45645
45 45645
46 45645
47 45645
答案 1 :(得分:1)
我首先将“低价”与“低价”组合在一起。列,然后总结volume_traded,重置索引。这将有效地累积所有利息价格,然后你想按价格排序,这使价格单调,以便我们可以用它作为指数。设置为索引后,我们可以调用reindex并计算新索引并使用method='pad'
填充缺失的值:
In [33]:
temp="""low_price,high_price,volume_traded
10,20,45667
15,22,256565
41,47,45645
10,20,12345
30,39,547343"""
df = pd.read_csv(io.StringIO(temp))
df
Out[33]:
low_price high_price volume_traded
0 10 20 45667
1 15 22 256565
2 41 47 45645
3 10 20 12345
4 30 39 547343
In [34]:
df1 = df.groupby('low_price')['volume_traded'].sum().reset_index()
df1
Out[34]:
low_price volume_traded
0 10 58012
1 15 256565
2 30 547343
3 41 45645
In [36]:
df1.sort(['low_price']).set_index(['low_price']).reindex(index = np.arange(df1['low_price'].min(), df1['low_price'].max()+1), method='pad')
Out[36]:
volume_traded
low_price
10 58012
11 58012
12 58012
13 58012
14 58012
15 256565
16 256565
17 256565
18 256565
19 256565
20 256565
21 256565
22 256565
23 256565
24 256565
25 256565
26 256565
27 256565
28 256565
29 256565
30 547343
31 547343
32 547343
33 547343
34 547343
35 547343
36 547343
37 547343
38 547343
39 547343
40 547343
41 45645