Question

我正在尝试将一堆xlsx文件合并到python中的单个pandas数据帧中。此外，我想添加一列，列出每一行的源文件。我的代码如下：

import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import glob
import os

# get the path for where the xlsx files are
path = os.getcwd()
files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']

# create new dataframe
df = pd.DataFrame()

# read data from files and add into dataframe
for f in files_xlsx:
    data = pd.read_excel(f, 'Sheet 1')
    df['Source_file'] = f
    df = df.append(data)

但是，当我查看“ Source_file”列时，它列出了最终读取的文件作为每一行的名称。我花了很多时间来解决这个问题。我在做什么错了？

Answer 1

在for循环中，您正在编写df的每个迭代，因此您只会取回最终文件，

您需要做的是在手之前添加一个清单，然后附加它，

因为您叫glob，所以也可以使用它。

files = glob.glob(os.path.join(os.getcwd()) + '\*.xlsx')
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
df = pd.concat(dfs)

如果您也想将文件名添加到df中，

files = glob.glob(os.path.join(os.getcwd()) + '\*.xlsx')
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
file_names = [os.path.basename(f) for f in files]
df = pd.concat(dfs,keys=file_names)

使用Pathlib模块（推荐Python 3.4 +）

from pathlib import Path
files = [f for f in Path.cwd().glob('*.xlsx')]
dfs = [pd.read_excel(f,sheet_name='Sheet1')]
file_names = [f.stem for f in files] 
df = pd.concat(dfs,keys=file_names)

或作为一个班轮：

df = pd.concat([pd.read_excel(f) for f in Path.cwd().glob('*.xlsx')],keys=[f.stem for f in Path.cwd().glob('*.xlsx')],sort=False)

导入多个Excel文件并以源名称作为列合并到单个pandas df中

1 个答案:

如果您也想将文件名添加到df中，

使用Pathlib模块（推荐Python 3.4 +）