导入多个Excel文件并以源名称作为列合并到单个pandas df中

时间:2019-11-08 21:44:32

标签: python excel pandas dataframe

我正在尝试将一堆xlsx文件合并到python中的单个pandas数据帧中。此外,我想添加一列,列出每一行的源文件。我的代码如下:

import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import glob
import os

# get the path for where the xlsx files are
path = os.getcwd()
files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']

# create new dataframe
df = pd.DataFrame()

# read data from files and add into dataframe
for f in files_xlsx:
    data = pd.read_excel(f, 'Sheet 1')
    df['Source_file'] = f
    df = df.append(data)

但是,当我查看“ Source_file”列时,它列出了最终读取的文件作为每一行的名称。我花了很多时间来解决这个问题。我在做什么错了?

1 个答案:

答案 0 :(得分:0)

在for循环中,您正在编写df的每个迭代,因此您只会取回最终文件,

您需要做的是在手之前添加一个清单,然后附加它,

因为您叫glob,所以也可以使用它。

files = glob.glob(os.path.join(os.getcwd()) + '\*.xlsx')
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
df = pd.concat(dfs)

如果您也想将文件名添加到df中,

files = glob.glob(os.path.join(os.getcwd()) + '\*.xlsx')
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
file_names = [os.path.basename(f) for f in files]
df = pd.concat(dfs,keys=file_names)

使用Pathlib模块(推荐Python 3.4 +)

from pathlib import Path
files = [f for f in Path.cwd().glob('*.xlsx')]
dfs = [pd.read_excel(f,sheet_name='Sheet1')]
file_names = [f.stem for f in files] 
df = pd.concat(dfs,keys=file_names)

或作为一个班轮:

df = pd.concat([pd.read_excel(f) for f in Path.cwd().glob('*.xlsx')],keys=[f.stem for f in Path.cwd().glob('*.xlsx')],sort=False)