我需要根据单独的表将自己行的修改后的副本添加到数据框中,但是它需要运行数百万行,而我在优化它时遇到了问题。
我有一个数据帧df,其中一个标识符是某个列'myIndex'
df=pd.DataFrame([
{'myIndex':1,'dummyData':2,'type':'test'},
{'myIndex':1,'dummyData':3,'type':'test'},
{'myIndex':1,'dummyData':4,'type':'other'},
{'myIndex':2,'dummyData':22,'type':'test'},
{'myIndex':2,'dummyData':32,'type':'test'},
{'myIndex':2,'dummyData':42,'type':'other'},
{'myIndex':3,'dummyData':23,'type':'test'},
{'myIndex':3,'dummyData':33,'type':'test'},
{'myIndex':3,'dummyData':43,'type':'other'}
])
我必须根据另一个数据框中的表重复其行
repeatedRows=pd.DataFrame([
{'myIndex':1,'idx2':21},
{'myIndex':1,'idx2':31},
{'myIndex':1,'idx2':42},
{'myIndex':2,'idx2':221},
{'myIndex':2,'idx2':231},
{'myIndex':2,'idx2':242},
{'myIndex':3,'idx2':321},
{'myIndex':3,'idx2':331},
{'myIndex':4,'idx2':342}
])
因此,myIndex出现在repeatedRows中的df行应该被复制,其中'myIndex'被idx2替换,'type'被'duplicate'替换。这是我最近的尝试:
import pandas as pd
pd.options.mode.chained_assignment = None # Avoid annoying warning
def Test1(df,repeatedRows):
for idx in set(repeatedRows['myIndex']):
vals=df[df['myIndex']==idx]
for n in repeatedRows[repeatedRows['myIndex']==idx].loc[:,'idx2']:
vals.loc[:,'myIndex']=n
vals.loc[:,'type']='duplicate'
df=pd.concat([df,vals],ignore_index=True)
return df
这是预期的结果:
untitled0.Test1()
Out[44]:
dummyData myIndex type
0 2 1 test
1 3 1 test
2 4 1 other
3 22 2 test
4 32 2 test
5 42 2 other
6 23 3 test
7 33 3 test
8 43 3 other
9 2 21 duplicate
10 3 21 duplicate
11 4 21 duplicate
12 2 31 duplicate
13 3 31 duplicate
14 4 31 duplicate
15 2 42 duplicate
16 3 42 duplicate
17 4 42 duplicate
18 22 221 duplicate
19 32 221 duplicate
20 42 221 duplicate
21 22 231 duplicate
22 32 231 duplicate
23 42 231 duplicate
24 22 242 duplicate
25 32 242 duplicate
26 42 242 duplicate
27 23 321 duplicate
28 33 321 duplicate
29 43 321 duplicate
30 23 331 duplicate
31 33 331 duplicate
32 43 331 duplicate
我已经用几种不同的方式实现了它,但它仍然需要很长时间,我想我正在做一些非常低效的事情,或者没有充分利用大熊猫。任何建议都将不胜感激。
答案 0 :(得分:2)
您需要合并两个数据框,进行修改并将其连接回df
:
mer = df.merge(repeatedRows, how='left')
mer['myIndex'] = mer['idx2']
mer['type'] = 'duplicate'
df = pd.concat([df, mer.drop(['idx2'], axis=1)], ignore_index=True)