我正在尝试使用从其他数据集收集的值过滤数据集。
例如
数据集1
Name Group Colour Age Title JobRole
John Smith 1 NaN NaN NaN NaN
John Smith 2 NaN NaN NaN NaN
John Smith 3 NaN NaN NaN NaN
James Man 1 NaN NaN NaN NaN
.....
dataset2
Name Colour Age Title JobRole
John Smith Red 35 Mr SuperMan
James Man Orange 21 Mr SuperMan
.....
我想在dataset1中获取每个名称,然后过滤dataset2。最终目标是将所有数据添加到dataset1s NaN值中。
我遇到了过滤问题,我尝试了一些方法,但都产生了一个空的数据帧。
到目前为止尝试..
import pandas as pd
import numpy as np
groupDf = pd.read_excel("dataset1.xlsx")
newDf = pd.read_excel("dataset2.xlsx")
for name in newDf['Name']:
filtered_data = groupDf[groupDf.Name == name]
print(filtered_data)
输出
Empty DataFrame
Columns: [Name, Group]
Index: []
答案 0 :(得分:0)
我认为需要merge
使用左连接,combine_first
需要替换NaN
s:
#left join df2, if existing columns name is added _ to end
df = df1.merge(df2, on='Name', how='left', suffixes=('','_'))
#filter columns names
new_cols = df.columns[df.columns.str.endswith('_')]
#remove last char from column names
orig_cols = new_cols.str[:-1]
#dictionary for rename
d = dict(zip(new_cols, orig_cols))
#filter columns and replace NaNs by new appended columns
df[orig_cols] = df[orig_cols].combine_first(df[new_cols].rename(columns=d))
#remove appended columns
df = df.drop(new_cols, axis=1)
#print (df)