Question

我有4个excel文件 - ＆＃39; a1.xlsx＆＃39;，＆＃39; a2.xlsx＆＃39;，＆＃39; a3.xlsx＆＃39;，＆＃39; a4.xlsx＆＃ 39; 文件格式相同

例如a1.xlsx看起来像：

id    code    name
1      100    abc
2      200    zxc
...    ...    ...

我必须在pandas dataframe中读取这些文件，并检查多个excel文件中是否存在code列的相同值。

像这样的事情。

如果code=100中存在'a1.xlsx','a3.xlsx'，code=200只存在'a1.xlsx'

最终数据框应如下所示：

code    filename
100   a1.xlsx,a3.xlsx
200   a1.xlsx
...   ....
and so on

我拥有目录中的所有文件，并尝试通过循环

进行迭代

import pandas as pd
import os
x = next(os.walk('path/to/files/'))[2]  #list all files in directory
os.chdir('path/to/files/')

for i in range (0,len(x)):
    df = pd.read_excel(x[i])

如何进行？任何线索？

Answer 1

使用：

import glob 

#get all filenames 
files = glob.glob('path/to/files/*.xlsx')
#list comprehension with assign new column for filenames
dfs = [pd.read_excel(fp).assign(filename=os.path.basename(fp).split('.')[0]) for fp in files]
#one big df from list of dfs
df = pd.concat(dfs, ignore_index=True)
#join all same codes
df1 = df.groupby('code')['filename'].apply(', '.join).reset_index()

查找多个数据框中是否存在列值

1 个答案: