具有两个同时包含列id1,id2的数据帧F1和F2。
F1包含5列。 F2包含三列[id1,id2,Description]我想测试F2 ['id1']中是否存在F1 ['id1']或F2 ['id2']中是否存在F1 ['id2'] 那么我必须在F1中添加colmun并在F2`中对此id1或id2进行描述。 F1和F2的内容在这里。我也在F1上参加的输出在这里,我像这样
创建了F1和F2 F1 = {'id1': ['x22', 'x13','NaN','x421'],'id2':['NaN',223,788,'NaN']}
F1 = pd.DataFrame(data=F1)
F2 = {'id1': ['x22', 'NaN','NaN','x413','x421'],'id2':['NaN','223','788','NaN','233'],'Description':['California','LA','NY','Havnover','Munich']}
F2 = pd.DataFrame(data=F2)
这就是我所做的:
s1 = F2.drop_duplicates('id1').dropna(subset=['id1']).set_index('id1')['Description']
s2 = F2.drop_duplicates('id2').dropna(subset=['id2']).set_index('id2')['Description']
F1['Description'] = F1['id1'].map(s1).combine_first(F1['id2'].map(s2))
我如何更正我的代码以获得此结果
F1的结果
F1 = {'id1': ['x22', 'x13','NaN','x421'],'id2':['NaN',223,788,'NaN'],'Name':['NNNN','AAAA','XXXX','OOO'],'V1':['oo','li','la','lo'],'Description':['Clafiornia','LA','NY','Munich']}
F1 = pd.DataFrame(data=F1)
答案 0 :(得分:1)
您可以使用isin()
函数检查两个df中的ID是否都存在:
F1 = {'id1': ['x22', 'x13','NaN','x421'],'id2':['NaN', 223, 788,'NaN']}
F1['id2'] = [str(x) if ~isinstance(x, str) else x for x in F1['id2']]
F1 = pd.DataFrame(data=F1)
F2 = {'id1': ['x22', 'NaN','NaN','x413','x421'],'id2':['NaN','223','788','NaN','233'],'Description':['California','LA','NY','Havnover','Munich']}
F2 = pd.DataFrame(data=F2)
F1['Description'] = ''
F1['Description'] = ''
id1_F1 = (F1[F1['id1']!='NaN']['id1'].isin(F2['id1']))
id1_F2 = (F2[F2['id1']!='NaN']['id1'].isin(F1['id1']))
id2_F1 = (F1[F1['id2']!='NaN']['id2'].isin(F2['id2']))
id2_F2 = (F2[F2['id2']!='NaN']['id2'].isin(F1['id2']))
F1.loc[id1_F1[id1_F1].index.values, 'Description'] = F2.loc[id1_F2[id1_F2].index.values, 'Description'].values
F1.loc[id2_F1[id2_F1].index.values, 'Description'] = F2.loc[id2_F2[id2_F2].index.values, 'Description'].values
输出:
id1 id2 Description
0 x22 NaN California
1 x13 223 LA
2 NaN 788 NY
3 x421 NaN Munich
答案 1 :(得分:0)
解决方案效果很好,但数据中存在问题-前NaN
的值不是缺失的,而是string
,所以必要的replace
,然后是F2['id2']
的第二个问题是数值是数字的字符串表示形式,因此将to_numeric
与errors='coerce'
相加:
F1 = {'id1': ['x22', 'x13','NaN','x421'],'id2':['NaN',223,788,'NaN']}
F1 = pd.DataFrame(data=F1)
F2 = {'id1': ['x22', 'NaN','NaN','x413','x421'],'id2':['NaN','223','788','NaN','233'],
'Description':['California','LA','NY','Havnover','Munich']}
F2 = pd.DataFrame(data=F2)
#solution for sample data
F1 = F1.replace('NaN', np.nan)
F2 = F2.replace('NaN', np.nan)
F1['id2'] = pd.to_numeric(F1['id2'], errors='coerce').fillna(F1['id2'])
F2['id2'] = pd.to_numeric(F2['id2'], errors='coerce').fillna(F2['id2'])
仅将两个DataFrame中的id
列替换为两个列中的DataFrames
的值都转换为数字的一般解决方案:
cols = ['id1','id2']
F1[cols] = F1[cols].replace('NaN', np.nan)
F1[cols] = F1[cols].apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(F1[cols])
F2[cols] = F2[cols].replace('NaN', np.nan)
F2[cols] = F2[cols].apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(F2[cols])
具有自定义功能的另一种解决方案:
def func(x):
try:
return float(x)
except Exception:
return x
cols = ['id1','id2']
F1[cols] = F1[cols].applymap(func)
F2[cols] = F2[cols].applymap(func)
print (F1)
id1 id2
0 x22 NaN
1 x13 223.0
2 NaN 788.0
3 x421 NaN
print (F2)
id1 id2 Description
0 x22 NaN California
1 NaN 223.0 LA
2 NaN 788.0 NY
3 x413 NaN Havnover
4 x421 233.0 Munich
s1 = F2.drop_duplicates('id1').dropna(subset=['id1']).set_index('id1')['Description']
s2 = F2.drop_duplicates('id2').dropna(subset=['id2']).set_index('id2')['Description']
F1['Description1'] = F1['id1'].map(s1).combine_first(F1['id2'].map(s2))
print (F1)
id1 id2 Description1
0 x22 NaN California
1 x13 223.0 LA
2 NaN 788.0 NY
3 x421 NaN Munich