根据不同的表有效地复制数据帧行

时间:2018-05-23 00:18:51

标签: python pandas merge left-join

我需要根据单独的表将自己行的修改后的副本添加到数据框中,但是它需要运行数百万行,而我在优化它时遇到了问题。

我有一个数据帧df,其中一个标识符是某个列'myIndex'

    df=pd.DataFrame([
                 {'myIndex':1,'dummyData':2,'type':'test'},
                 {'myIndex':1,'dummyData':3,'type':'test'},
                 {'myIndex':1,'dummyData':4,'type':'other'},
                 {'myIndex':2,'dummyData':22,'type':'test'},
                 {'myIndex':2,'dummyData':32,'type':'test'},
                 {'myIndex':2,'dummyData':42,'type':'other'},
                 {'myIndex':3,'dummyData':23,'type':'test'},
                 {'myIndex':3,'dummyData':33,'type':'test'},
                 {'myIndex':3,'dummyData':43,'type':'other'}
                 ])

我必须根据另一个数据框中的表重复其行

    repeatedRows=pd.DataFrame([
                           {'myIndex':1,'idx2':21},
                           {'myIndex':1,'idx2':31},
                           {'myIndex':1,'idx2':42},
                           {'myIndex':2,'idx2':221},
                           {'myIndex':2,'idx2':231},
                           {'myIndex':2,'idx2':242},
                           {'myIndex':3,'idx2':321},
                           {'myIndex':3,'idx2':331},
                           {'myIndex':4,'idx2':342}
                           ])

因此,myIndex出现在repeatedRows中的df行应该被复制,其中'myIndex'被idx2替换,'type'被'duplicate'替换。这是我最近的尝试:

import pandas as pd
pd.options.mode.chained_assignment = None # Avoid annoying warning
def Test1(df,repeatedRows):
    for idx in set(repeatedRows['myIndex']):
        vals=df[df['myIndex']==idx]
        for n in repeatedRows[repeatedRows['myIndex']==idx].loc[:,'idx2']:
            vals.loc[:,'myIndex']=n
            vals.loc[:,'type']='duplicate'
            df=pd.concat([df,vals],ignore_index=True)
    return df

这是预期的结果:

untitled0.Test1()
Out[44]: 
    dummyData  myIndex       type
0           2        1       test
1           3        1       test
2           4        1      other
3          22        2       test
4          32        2       test
5          42        2      other
6          23        3       test
7          33        3       test
8          43        3      other
9           2       21  duplicate
10          3       21  duplicate
11          4       21  duplicate
12          2       31  duplicate
13          3       31  duplicate
14          4       31  duplicate
15          2       42  duplicate
16          3       42  duplicate
17          4       42  duplicate
18         22      221  duplicate
19         32      221  duplicate
20         42      221  duplicate
21         22      231  duplicate
22         32      231  duplicate
23         42      231  duplicate
24         22      242  duplicate
25         32      242  duplicate
26         42      242  duplicate
27         23      321  duplicate
28         33      321  duplicate
29         43      321  duplicate
30         23      331  duplicate
31         33      331  duplicate
32         43      331  duplicate

我已经用几种不同的方式实现了它,但它仍然需要很长时间,我想我正在做一些非常低效的事情,或者没有充分利用大熊猫。任何建议都将不胜感激。

1 个答案:

答案 0 :(得分:2)

您需要合并两个数据框,进行修改并将其连接回df

mer = df.merge(repeatedRows, how='left')
mer['myIndex'] = mer['idx2']
mer['type'] = 'duplicate'
df = pd.concat([df, mer.drop(['idx2'], axis=1)], ignore_index=True)