我要遍历文件夹中的文件,我想基于名为key的变量合并数据集。这是到目前为止的代码。我有一个数据集可能看起来是什么样的示例/我希望最终结果看起来像什么:
dfs=[]
for f in files:
for name, sheet in sheets_dict.items():
if name=="Main":
data = sheet
dfs.append(data)
dfs的示例:
df1 = {'key': ["A","B"], 'Answer':["yes","No"]}
df1 = pd.DataFrame(data=df1)
df2={'key': ["A","C"], 'Answer':["No","c"]}
df2 = pd.DataFrame(data=df2)
最终输出
final={'A': ["yes","No"], 'B':["No",""],'C':["","c"],'file':['df1','df2']}
final = pd.DataFrame(data=final)
这是我尝试过的方法,但是我无法使它起作用:
df_key={'key': ["A","B","C"]}
df_key = pd.DataFrame(data=df_key)
df_final=[]
for df in dfs:
temp= pd.merge(df_key[['key']],df, on=['key'], how = 'left')
temp_t= temp.transpose()
df_final.append(temp_t)
答案 0 :(得分:1)
重塑和连接数据帧非常简单。但是,为了添加file
值,您将需要a)将数据帧的名称包含在字符串列表中,或者b)在运行时生成新名称。
这是代码
dfs = [df1, df2] # populate dfs as needed
master_df = []
df_key = {'key': ["A","B","C"]}
df_key = pd.DataFrame(df_key) # assuming you already have this dataframe created
master_df.append(pd.Series(index=df_key.columns))
for i, df in enumerate(dfs):
df = df.set_index('key').squeeze()
df.loc['file'] = f'df{i+1}'
master_df.append(df)
# or iterate the dfs alongside their file names
# for fname, df in zip(file_names, dfs):
# df = df.set_index('key').squeeze()
# df.loc['file'] = fname
# master_df.append(df)
master_df = pd.concat(master_df, axis=1).T
# rearrange columns
master_df = master_df[
master_df.columns.difference(['file']).to_list() + ['file']
]
# fill NaNs with empty string
master_df.fillna('', inplace=True)
输出
A B C file
Answer yes No df1
Answer No c df2