我有2个带有两个标识符列的csv文件,例如“ a”,“ b”和“ c”。
CSV至少必须匹配1列(因此并非全部3列)。
我肯定知道这些标识符中至少有一个匹配(我不知道哪个)。还有一些情况,其中'a'!='a'但'b'=='b',在这种情况下,我仍然希望大小写匹配。
我的想法是通过循环通过这些标识符列将值添加到一个csv文件中:例如,如果'a'!='a'检查'b'=='b',然后添加列从第二个csv开始。
这可能不是最有效的方法,但这是我能想到的唯一方法(我对python还是很陌生)。
df1['var1'] = 0
for index,row in df1.iterrows():
print(index)
for index2, row2 in df2.iterrows():
if df1['a'][index] == df2['a'][index2]:
df1['var1'] = df2['var1'] #add rest of variables
elif df1['b'][index] == df2['b'][index2]:
df1['var1'] = df2['var1'] #add rest of variables
elif df1['c'][index] == df2['c'][index2]:
df1['var1'] = df2['var1'] #add rest of variables
else df1['var1'] = np.nan
样本数据
df1
a;b;c GWIMPBWGXFLOXCTMWTQZ;JWRLDDZNSEDQIJWZXUKC;CKKAYMVNTLQHRJMKTGYM IOUXKHIERLLTIWFZNBOY;LTVJGHXDSQBIISYRUGSB;FWIIEJPSGJIDMBMMHVCC VDGPMKXPKMQYCFPSPRVV;JODUSSSZMVGJMPNUZZTU;SXMSOPVFRLYBJVYJEIRW XJLLWPCRPLYAOKWGCNSA;QOABSMYWLCMRZUQETBSW;LTYFHWMKPDPUXJDFXEGE HKSGVXNGZYCDKIVMHPOQ;TXWBAJZNXCHRNDTOVGSK;SFUIWDVEVBQASJPXGYET LJHOTWYPZCXJQMJDFBMX;KFTJXNDFDZHLKNHGIBPO;PNEBLKIVUVUKBOLRNJWR JKYIABDSHIMCFBFKYMHI;FAUMYUUFVVKGIFODYMBM;YKCNNIRFLWDFKXAJBIYB FGNQDGBIHUQOXHUZFZVG;EKYZSQQDWNABDOMUYBCB;ZCJTLHITYEUIQOAXEMIC SVZAYRKZKTLCSWLYUTXH;JZMPNGLCCWVZOEQBDCEJ;JGDYBLYRBACDNTHEKJKI UEEUYZHPMJRPFYPRWLGX;MTJOFRQYEXAQDZFHXMJE;SLEAHIGGOYJKRMDLIYQB ZILSTFUZVBNQVCQBRLCQ;VLJPEKQTHVYJSSPDCTXO;VEXYZXHKQANMYCSWJCKJ WFIEQVJAAPBJRLBOFVLM;OHUNXXTJGIVAOQNWUKZV;IYVKLYRFQWKDXEOLYBCU VGPJZITWIOHVOJGBVKPD;XUOWFMLJZPGXMDICKTRM;DZIAVAPJYOAETIZOGIOV BBWCSDGLFWPJNGYHJFJY;XWAFMPCGCJLZDDQDKYWJ;ODMXYHHRCIOCTKWUETIG OXDFCYSCNNOLILXYUBKD;HOKQECAJJTPWWCILRXSR;XWZZKFJXSKUEJRMJNAWW ZEJZXTIQMKLUGHLHHLXD;GKDGXNGWNPEQBFFISGPM;ZPMKALEPWATAWNEOYXAR QICFKQZOYPYGQJDUIMSC;YQWKXJXEWMXISJVPRVVV;IIDRIDKDPXTOIMVTBERK CXJPRVANPQYDERCZIUDB;DQOLCHRUTYZEOJSFQRFN;XVMJLZBHSTOXPIQOOJTM FTSITDDXKVIEOAOFFDXV;AWPPKQQNVUAHMJICUXVA;BWIXIYBZUGJYBHHAQZWO QHDUVSQFETFVZJOKNNZV;VJSMCXMOWFKRKXMGAYRI;XJALGABNCZWVKHMXWWCW
df2
a;b;c;var1 GWIMEEBWGXFLOXCTMWTQZ;;CKKAYMVNTLQHRJMKTGYM;834562 IOUXKHIERLLTIWFZNBOY;LTVJGHXDSQBIISYRUGSB;FWEERRPSGJIDMBMMHVCC;2345658 ;JODUSSSZMVGJMPNUZZTU;SXMSOPVFRLYBJVYJEIRW;662453 XJLLWPCRPLYAOKWGCNSA;QOABSMYWLCMRZUQETBSW;;324276 HKSGVXNGZYCDKIVMHPOQ;TXWBAJZNXCHRNDTOVGSK;SFUIWDVEVBQASJPXGYET;1134921 LJHOTWYPZCXJQMJDFBMX;KFTJXNDFDZHLKNHGIBPO;PNEBLKIVUVUKBOLRNJWR;2019234 JKYIABDSHIMCFBFKYMHI;FAUMYUUFVVKGIFODYMBM;YKCNNIRFLWDFKXAJBIYB;9872346 FGNQDGBIHUQOXHUZFZVG;EKYZSQQDWNABDOMUYBCB;ZCJTLHITYEUIQOAXEMIC;7564374 SVZAYRKZKTLCSWLYUTXH;;;2345252 UEEUYZHPMJRPFYPRWLGX;MTJOFRQYEXAQDZFHXMJE;SLEAHIGGOYJKRMDLIYQB;5654632 ZILSTDSDSBNQVCQBRLCQ;;VEXYZXHKQANMYCSWJCKJ;4524234 WFIEQVJAAPBJRLBOFVLM;OHUNXXTJGIVAOQNWUKZV;IYVKLYRFQWKDXEOLYBCU;2423423 VGPJZITWIOHVOJGBVKPD;XUOWFMLJZPGXMDICKTRM;DZIAVAPJYOAETIZOGIOV;3423425 ;XWAFREWGCJLZDDQDKYWJ;ODMXYHHRCIOCTKWUETIG;7864375 OXDFCYSCNNOLILXYUBKD;HOKQECAJJTPWWCILRXSR;XWZZKFJXSKUEJRMJNAWW;2132543 ZEJZXTIQMKLUGHLHHLXD;GKDGXNGWNPEQBFFISGPM;ZREWALEPWATAWNEOYXAR;4524235 QICFKQZOYPYGQJDUIMSC;;IIDRIDKDPXTOIMVTBERK;5544332 CXJPRVANPQYDERCZIUDB;DQOLCHRUTYZEOJSFQRFN;XVMJLZBHSTOXPIQOOJTM;9345633 FTSITDDXKVIEWAOFFDXV;AWPPKQQNVUAHMJICUXVA;BWIXIYBZUGJYBHHAQZWO;4213465 QHDUVSQFETFVZJOKNNZV;VJSMCXMOWFKRKXMGAYRI;XJALGABNCZWVKHMXWWCW;2143112
我希望var1包含第二个csv的值(当与'a','b'或'c'匹配时)或在没有匹配的标识符时包含NaN值(这不应该)确实如此,但可能在此示例数据中。)
但是,一定有问题,因为代码非常慢,要遍历1行大约需要20秒。数据集并不庞大(599个案例)。
必须有一种更简便,更快速的方法来执行此操作。您能帮我指出正确的方向吗?谢谢!
答案 0 :(得分:1)
在您的示例中,所有内容均按照正确的顺序完全排序,那么您可以这样做:
df1['var1'] = df2['var1']
由于这可能不是您的意思,因此,这是一种无法轻松匹配数据帧的解决方案:
首先将列“ a”上的df1与df2列“ a”和“ var1”合并。这基本上是左连接,如果可以进行连接,则添加“ var1”,否则添加“ nan”。
df1_a = df1.merge(df2.loc[df2['a'].notnull(), ['a','var1']], how='left')
然后对其他列再次执行相同操作,因此一次合并在“ b”列上,另一次合并在“ c”列上:
df1_b = df1.merge(df2.loc[df2['b'].notnull(), ['b','var1']], how='left')
df1_c = df1.merge(df2.loc[df2['c'].notnull(), ['c','var1']], how='left')
最后将所有单独的数据帧组合在一起,但前提是var列中有一个nan:
df1 = df1_a.fillna(df1_b).fillna(df1_c)
df1['var1'] = df1['var1'].astype(int)
结果:
a b c var1
0 GWIMPBWGXFLOXCTMWTQZ JWRLDDZNSEDQIJWZXUKC CKKAYMVNTLQHRJMKTGYM 834562
1 IOUXKHIERLLTIWFZNBOY LTVJGHXDSQBIISYRUGSB FWIIEJPSGJIDMBMMHVCC 2345658
2 VDGPMKXPKMQYCFPSPRVV JODUSSSZMVGJMPNUZZTU SXMSOPVFRLYBJVYJEIRW 662453