我有一些代码将实际数据与目标数据进行比较,其中实际数据位于一个DataFrame中,而目标则位于另一个DataFrame中。我需要查找目标,将其与实际数据一起放入df,然后将两者进行比较。在下面的简化示例中,我有一组产品和一组位置,每个目标都有唯一的目标。
我正在使用嵌套的for循环来实现这一目标:依次浏览产品和位置。问题在于我的现实生活数据在各个维度上都较大,并且要花很长时间才能遍历所有内容。
我看过各种SO文章,但似乎没有(与我联系!)与大熊猫有关和/或与我的问题无关。有人对如何矢量化此代码有个好主意吗?
import pandas as pd
import numpy as np
import time
employee_list = ['Joe', 'Bernie', 'Elizabeth', 'Kamala', 'Cory', 'Pete',
'Amy', 'Andrew', 'Beto', 'Jay', 'Kristen', 'Julian',
'Mike', 'John', 'Tulsi', 'Tim', 'Eric', 'Seth', 'Howard',
'Bill']
location_list = ['Denver', 'Boulder', 'Phoenix', 'Reno', 'Portland',
'Eugene', 'San Francisco']
product_list = ['Product1', 'Product2', 'Product3', 'Product4', 'Product5']
tgt_data = {'Location' : location_list,
'Product1' : [600, 200, 750, 225, 450, 175, 900],
'Product2' : [300, 100, 350, 125, 200, 90, 450],
'Product3' : [700, 250, 950, 275, 600, 225, 1200],
'Product4' : [200, 100, 250, 75, 150, 75, 300],
'Product5' : [900, 300, 1000, 400, 600, 275, 1300]}
tgt_df = pd.DataFrame(data = tgt_data)
employee_data = {'Employee' : employee_list,
'Location' : ['Boulder', 'Denver', 'Portland', 'Denver',
'San Francisco', 'Phoenix', 'San Francisco',
'Eugene', 'San Francisco', 'Reno', 'Denver',
'Phoenix', 'Denver', 'Portland', 'Reno',
'Boulder', 'San Francisco', 'Phoenix',
'San Francisco', 'Phoenix'],
'Product1' : np.random.randint(1, 1000, 20),
'Product2' : np.random.randint(1, 700, 20),
'Product3' : np.random.randint(1, 1500, 20),
'Product4' : np.random.randint(1, 500, 20),
'Product5' : np.random.randint(1, 1500, 20)}
emp_df = pd.DataFrame(data = employee_data)
start = time.time()
for p in product_list:
for l in location_list:
emp_df.loc[emp_df['Location'] == l, p + '_tgt'] = (
tgt_df.loc[tgt_df['Location']==l, p].values)
emp_df[p + '_pct'] = emp_df[p] / emp_df[p + '_tgt']
print(emp_df)
end = time.time()
print(end - start)
答案 0 :(得分:1)
如果保证目标数据帧具有唯一位置,则可以使用联接使此过程真正快速。
library(qdapTools)
mtabulate(setNames(strsplit(Itemsvector, " "), Itemsvector))
设置完成后,我们现在可以使用联接了。
import pandas as pd
import numpy as np
import time
employee_list = ['Joe', 'Bernie', 'Elizabeth', 'Kamala', 'Cory', 'Pete',
'Amy', 'Andrew', 'Beto', 'Jay', 'Kristen', 'Julian',
'Mike', 'John', 'Tulsi', 'Tim', 'Eric', 'Seth', 'Howard',
'Bill']
location_list = ['Denver', 'Boulder', 'Phoenix', 'Reno', 'Portland',
'Eugene', 'San Francisco']
product_list = ['Product1', 'Product2', 'Product3', 'Product4', 'Product5']
tgt_data = {'Location' : location_list,
'Product1' : [600, 200, 750, 225, 450, 175, 900],
'Product2' : [300, 100, 350, 125, 200, 90, 450],
'Product3' : [700, 250, 950, 275, 600, 225, 1200],
'Product4' : [200, 100, 250, 75, 150, 75, 300],
'Product5' : [900, 300, 1000, 400, 600, 275, 1300]}
tgt_df = pd.DataFrame(data = tgt_data)
employee_data = {'Employee' : employee_list,
'Location' : ['Boulder', 'Denver', 'Portland', 'Denver',
'San Francisco', 'Phoenix', 'San Francisco',
'Eugene', 'San Francisco', 'Reno', 'Denver',
'Phoenix', 'Denver', 'Portland', 'Reno',
'Boulder', 'San Francisco', 'Phoenix',
'San Francisco', 'Phoenix'],
'Product1' : np.random.randint(1, 1000, 20),
'Product2' : np.random.randint(1, 700, 20),
'Product3' : np.random.randint(1, 1500, 20),
'Product4' : np.random.randint(1, 500, 20),
'Product5' : np.random.randint(1, 1500, 20)}
emp_df = pd.DataFrame(data = employee_data)
答案 1 :(得分:0)
您正在使用“宽格式”数据框。我觉得“长格式”更易于操作。
# turn emp_df into long
# indexed by "Employee", "Location", and "Product"
emp_df = (emp_df.set_index(['Employee', 'Location'])
.stack().to_frame())
emp_df.head()
0
Employee Location
Joe Boulder Product1 238
Product2 135
Product3 873
Product4 153
Product5 373
# turn tmp_df into a long series
# indexed by "Location" and "Product"
tgt_df = tgt_df.set_index('Location').stack()
tgt_df.head()
# set target for employees by locations:
emp_df['target'] = (emp_df.groupby('Employee')[0]
.apply(lambda x: tgt_df))
# percentage
emp_df['pct'] = emp_df[0]/emp_df['target']
# you can get the wide format back by
# emp_df = emp_df.unstack(level=2)
# which will give you a dataframe with
# multi-level index and multi-level column