有一个装满Excel文件的文件夹。一个恼人的方面是它们都是.xls
(而不是.xlsx
)。
我需要做的是读取每个.xls
文件,删除前7行,然后取出剩余的文档并将其添加到" master.xlsx"文件。 (注意:master.xlsx不必预先存在,可以新创建)
我还没有开始尝试删除行,只是尝试简单地合并它们,但无法弄清楚如何。我是否需要以某种方式将所有.xls首先转换为.xlsx,然后尝试合并?我花了好几个小时查看其他Stack Overflow问题和在线资源。这似乎是某种古老的技术。此外,值得一提的是我使用的是Python3。
到目前为止,这是我的代码:
import os
from numpy import genfromtxt
import re
import urllib.request
import pandas as pd
# script directory
script_dir = os.path.dirname(r'C:/Users/Kenny/Desktop/pythonReports/')
# get array list of files
files = []
file_abs_path = script_dir + '/excels/'
for file in os.listdir(file_abs_path):
if file.endswith('.xls'):
excel_file_path = script_dir + '/excels/' + file
files.append(excel_file_path)
# f is full file path
df_array = []
writer = pd.ExcelWriter('master.xlsx')
for f in files:
sheet = pd.read_html(f)
for n, df in enumerate(sheet):
df_array.append(df)
# df = df.append(df)
# df.to_excel(writer,'sheet%s' % n)
print(df_array)
for df in df_array:
# new_df = new_df.append(df)
new_df = pd.concat(df_array)
new_df.to_excel(writer,'sheet%s' % n)
writer.save()
# print(sheet)
在某些时候我没有得到错误,它正在正确地读取和复制内容,但它会重写master.xlsx
并覆盖旧的东西,而不是连接它。
编辑
Merge现在正在运作。我现在的困难是我需要从单元格中获取数据,删除前7行,然后创建新列并将该数据添加到该列中的所有行(对于文档的长度)。
我觉得有一点让我很难,因为read_html()
不起作用,我必须使用read_excel()
。我收到以下错误:
Traceback (most recent call last):
File "script.py", line 83, in <module>
sheet = pd.read_excel(f)
File "C:\Users\Kenny\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\excel.py", line 200, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Users\Kenny\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\excel.py", line 257, in __init__
self.book = xlrd.open_workbook(io)
File "C:\Users\Kenny\AppData\Local\Programs\Python\Python36-32\lib\site-packages\xlrd\__init__.py", line 441, in open_workbook
ragged_rows=ragged_rows,
File "C:\Users\Kenny\AppData\Local\Programs\Python\Python36-32\lib\site-packages\xlrd\book.py", line 91, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "C:\Users\Kenny\AppData\Local\Programs\Python\Python36-32\lib\site-packages\xlrd\book.py", line 1230, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "C:\Users\Kenny\AppData\Local\Programs\Python\Python36-32\lib\site-packages\xlrd\book.py", line 1224, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\n<html>\n'
答案 0 :(得分:1)
这是我的最终合并解决方案(也有一个很好的小动态更改打印语句):
# Merge all .xlsx files into one 'master.xlsx'
files = get_files('/xlsx/', '.xlsx')
df_array = []
all_data = pd.DataFrame()
writer = pd.ExcelWriter('master.xlsx')
for i, f in enumerate(files, start=1):
sheet = pd.read_excel(f)
all_data = all_data.append(sheet, ignore_index=True)
# progress of entire list
if i <= len(files):
print('\r{:*^7}{:.0f}%'.format('Merging: ', i/len(files)*100), end='')
all_data.to_excel(writer, 'sheet')
writer.save()