我正在尝试将一堆xlsx文件合并到python中的单个pandas数据帧中。此外,我想添加一列,列出每一行的源文件。我的代码如下:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import glob
import os
# get the path for where the xlsx files are
path = os.getcwd()
files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
# create new dataframe
df = pd.DataFrame()
# read data from files and add into dataframe
for f in files_xlsx:
data = pd.read_excel(f, 'Sheet 1')
df['Source_file'] = f
df = df.append(data)
但是,当我查看“ Source_file”列时,它列出了最终读取的文件作为每一行的名称。我花了很多时间来解决这个问题。我在做什么错了?
答案 0 :(得分:0)
在for循环中,您正在编写df
的每个迭代,因此您只会取回最终文件,
您需要做的是在手之前添加一个清单,然后附加它,
因为您叫glob,所以也可以使用它。
files = glob.glob(os.path.join(os.getcwd()) + '\*.xlsx')
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
df = pd.concat(dfs)
files = glob.glob(os.path.join(os.getcwd()) + '\*.xlsx')
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
file_names = [os.path.basename(f) for f in files]
df = pd.concat(dfs,keys=file_names)
from pathlib import Path
files = [f for f in Path.cwd().glob('*.xlsx')]
dfs = [pd.read_excel(f,sheet_name='Sheet1')]
file_names = [f.stem for f in files]
df = pd.concat(dfs,keys=file_names)
或作为一个班轮:
df = pd.concat([pd.read_excel(f) for f in Path.cwd().glob('*.xlsx')],keys=[f.stem for f in Path.cwd().glob('*.xlsx')],sort=False)