在多列上匹配CSV(列并不总是匹配!)

时间:2019-01-07 17:33:19

标签: python pandas

我有2个带有两个标识符列的csv文件,例如“ a”,“ b”和“ c”。
CSV至少必须匹配1列(因此并非全部3列)。 我肯定知道这些标识符中至少有一个匹配(我不知道哪个)。还有一些情况,其中'a'!='a'但'b'=='b',在这种情况下,我仍然希望大小写匹配。

我的想法是通过循环通过这些标识符列将值添加到一个csv文件中:例如,如果'a'!='a'检查'b'=='b',然后添加列从第二个csv开始。

这可能不是最有效的方法,但这是我能想到的唯一方法(我对python还是很陌生)。

df1['var1'] = 0

for index,row in df1.iterrows():
    print(index)
    for index2, row2 in df2.iterrows():
        if df1['a'][index] == df2['a'][index2]:
            df1['var1'] = df2['var1'] #add rest of variables
        elif df1['b'][index] == df2['b'][index2]:
            df1['var1'] = df2['var1'] #add rest of variables
        elif df1['c'][index] == df2['c'][index2]:
        df1['var1'] = df2['var1'] #add rest of variables
        else df1['var1'] = np.nan

样本数据

df1

a;b;c
GWIMPBWGXFLOXCTMWTQZ;JWRLDDZNSEDQIJWZXUKC;CKKAYMVNTLQHRJMKTGYM
IOUXKHIERLLTIWFZNBOY;LTVJGHXDSQBIISYRUGSB;FWIIEJPSGJIDMBMMHVCC
VDGPMKXPKMQYCFPSPRVV;JODUSSSZMVGJMPNUZZTU;SXMSOPVFRLYBJVYJEIRW
XJLLWPCRPLYAOKWGCNSA;QOABSMYWLCMRZUQETBSW;LTYFHWMKPDPUXJDFXEGE
HKSGVXNGZYCDKIVMHPOQ;TXWBAJZNXCHRNDTOVGSK;SFUIWDVEVBQASJPXGYET
LJHOTWYPZCXJQMJDFBMX;KFTJXNDFDZHLKNHGIBPO;PNEBLKIVUVUKBOLRNJWR
JKYIABDSHIMCFBFKYMHI;FAUMYUUFVVKGIFODYMBM;YKCNNIRFLWDFKXAJBIYB
FGNQDGBIHUQOXHUZFZVG;EKYZSQQDWNABDOMUYBCB;ZCJTLHITYEUIQOAXEMIC
SVZAYRKZKTLCSWLYUTXH;JZMPNGLCCWVZOEQBDCEJ;JGDYBLYRBACDNTHEKJKI
UEEUYZHPMJRPFYPRWLGX;MTJOFRQYEXAQDZFHXMJE;SLEAHIGGOYJKRMDLIYQB
ZILSTFUZVBNQVCQBRLCQ;VLJPEKQTHVYJSSPDCTXO;VEXYZXHKQANMYCSWJCKJ
WFIEQVJAAPBJRLBOFVLM;OHUNXXTJGIVAOQNWUKZV;IYVKLYRFQWKDXEOLYBCU
VGPJZITWIOHVOJGBVKPD;XUOWFMLJZPGXMDICKTRM;DZIAVAPJYOAETIZOGIOV
BBWCSDGLFWPJNGYHJFJY;XWAFMPCGCJLZDDQDKYWJ;ODMXYHHRCIOCTKWUETIG
OXDFCYSCNNOLILXYUBKD;HOKQECAJJTPWWCILRXSR;XWZZKFJXSKUEJRMJNAWW
ZEJZXTIQMKLUGHLHHLXD;GKDGXNGWNPEQBFFISGPM;ZPMKALEPWATAWNEOYXAR
QICFKQZOYPYGQJDUIMSC;YQWKXJXEWMXISJVPRVVV;IIDRIDKDPXTOIMVTBERK
CXJPRVANPQYDERCZIUDB;DQOLCHRUTYZEOJSFQRFN;XVMJLZBHSTOXPIQOOJTM
FTSITDDXKVIEOAOFFDXV;AWPPKQQNVUAHMJICUXVA;BWIXIYBZUGJYBHHAQZWO
QHDUVSQFETFVZJOKNNZV;VJSMCXMOWFKRKXMGAYRI;XJALGABNCZWVKHMXWWCW

df2

a;b;c;var1
GWIMEEBWGXFLOXCTMWTQZ;;CKKAYMVNTLQHRJMKTGYM;834562
IOUXKHIERLLTIWFZNBOY;LTVJGHXDSQBIISYRUGSB;FWEERRPSGJIDMBMMHVCC;2345658
;JODUSSSZMVGJMPNUZZTU;SXMSOPVFRLYBJVYJEIRW;662453
XJLLWPCRPLYAOKWGCNSA;QOABSMYWLCMRZUQETBSW;;324276
HKSGVXNGZYCDKIVMHPOQ;TXWBAJZNXCHRNDTOVGSK;SFUIWDVEVBQASJPXGYET;1134921
LJHOTWYPZCXJQMJDFBMX;KFTJXNDFDZHLKNHGIBPO;PNEBLKIVUVUKBOLRNJWR;2019234
JKYIABDSHIMCFBFKYMHI;FAUMYUUFVVKGIFODYMBM;YKCNNIRFLWDFKXAJBIYB;9872346
FGNQDGBIHUQOXHUZFZVG;EKYZSQQDWNABDOMUYBCB;ZCJTLHITYEUIQOAXEMIC;7564374
SVZAYRKZKTLCSWLYUTXH;;;2345252
UEEUYZHPMJRPFYPRWLGX;MTJOFRQYEXAQDZFHXMJE;SLEAHIGGOYJKRMDLIYQB;5654632
ZILSTDSDSBNQVCQBRLCQ;;VEXYZXHKQANMYCSWJCKJ;4524234
WFIEQVJAAPBJRLBOFVLM;OHUNXXTJGIVAOQNWUKZV;IYVKLYRFQWKDXEOLYBCU;2423423
VGPJZITWIOHVOJGBVKPD;XUOWFMLJZPGXMDICKTRM;DZIAVAPJYOAETIZOGIOV;3423425
;XWAFREWGCJLZDDQDKYWJ;ODMXYHHRCIOCTKWUETIG;7864375
OXDFCYSCNNOLILXYUBKD;HOKQECAJJTPWWCILRXSR;XWZZKFJXSKUEJRMJNAWW;2132543
ZEJZXTIQMKLUGHLHHLXD;GKDGXNGWNPEQBFFISGPM;ZREWALEPWATAWNEOYXAR;4524235
QICFKQZOYPYGQJDUIMSC;;IIDRIDKDPXTOIMVTBERK;5544332
CXJPRVANPQYDERCZIUDB;DQOLCHRUTYZEOJSFQRFN;XVMJLZBHSTOXPIQOOJTM;9345633
FTSITDDXKVIEWAOFFDXV;AWPPKQQNVUAHMJICUXVA;BWIXIYBZUGJYBHHAQZWO;4213465
QHDUVSQFETFVZJOKNNZV;VJSMCXMOWFKRKXMGAYRI;XJALGABNCZWVKHMXWWCW;2143112

我希望var1包含第二个csv的值(当与'a','b'或'c'匹配时)或在没有匹配的标识符时包含NaN值(这不应该)确实如此,但可能在此示例数据中。)

但是,一定有问题,因为代码非常慢,要遍历1行大约需要20秒。数据集并不庞大(599个案例)。

必须有一种更简便,更快速的方法来执行此操作。您能帮我指出正确的方向吗?谢谢!

1 个答案:

答案 0 :(得分:1)

在您的示例中,所有内容均按照正确的顺序完全排序,那么您可以这样做:

df1['var1'] = df2['var1']

由于这可能不是您的意思,因此,这是一种无法轻松匹配数据帧的解决方案:

首先将列“ a”上的df1与df2列“ a”和“ var1”合并。这基本上是左连接,如果可以进行连接,则添加“ var1”,否则添加“ nan”。

df1_a = df1.merge(df2.loc[df2['a'].notnull(), ['a','var1']], how='left')

然后对其他列再次执行相同操作,因此一次合并在“ b”列上,另一次合并在“ c”列上:

df1_b = df1.merge(df2.loc[df2['b'].notnull(), ['b','var1']], how='left')

df1_c = df1.merge(df2.loc[df2['c'].notnull(), ['c','var1']], how='left')

最后将所有单独的数据帧组合在一起,但前提是var列中有一个nan:

df1 = df1_a.fillna(df1_b).fillna(df1_c)
df1['var1'] = df1['var1'].astype(int)

结果:

    a   b   c   var1
0   GWIMPBWGXFLOXCTMWTQZ    JWRLDDZNSEDQIJWZXUKC    CKKAYMVNTLQHRJMKTGYM    834562
1   IOUXKHIERLLTIWFZNBOY    LTVJGHXDSQBIISYRUGSB    FWIIEJPSGJIDMBMMHVCC    2345658
2   VDGPMKXPKMQYCFPSPRVV    JODUSSSZMVGJMPNUZZTU    SXMSOPVFRLYBJVYJEIRW    662453