Python合并.xls文件

时间:2017-07-11 16:17:07

标签: python excel python-3.x pandas

有一个装满Excel文件的文件夹。一个恼人的方面是它们都是.xls(而不是.xlsx)。

我需要做的是读取每个.xls文件,删除前7行,然后取出剩余的文档并将其添加到" master.xlsx"文件。 (注意:master.xlsx不必预先存在,可以新创建)

我还没有开始尝试删除行,只是尝试简单地合并它们,但无法弄清楚如何。我是否需要以某种方式将所有.xls首先转换为.xlsx,然后尝试合并?我花了好几个小时查看其他Stack Overflow问题和在线资源。这似乎是某种古老的技术。此外,值得一提的是我使用的是Python3。

到目前为止,这是我的代码:

import os
from numpy import genfromtxt
import re
import urllib.request
import pandas as pd


# script directory
script_dir = os.path.dirname(r'C:/Users/Kenny/Desktop/pythonReports/')


# get array list of files
files = []
file_abs_path = script_dir + '/excels/'
for file in os.listdir(file_abs_path):
    if file.endswith('.xls'):
        excel_file_path = script_dir + '/excels/' + file
        files.append(excel_file_path)

# f is full file path
df_array = []
writer = pd.ExcelWriter('master.xlsx')
for f in files:
    sheet = pd.read_html(f)

    for n, df in enumerate(sheet):
        df_array.append(df)
        # df = df.append(df)
    # df.to_excel(writer,'sheet%s' % n)
print(df_array)

for df in df_array:
        # new_df = new_df.append(df)
        new_df = pd.concat(df_array)
        new_df.to_excel(writer,'sheet%s' % n)
        writer.save()
    # print(sheet)

在某些时候我没有得到错误,它正在正确地读取和复制内容,但它会重写master.xlsx并覆盖旧的东西,而不是连接它。

编辑

Merge现在正在运作。我现在的困难是我需要从单元格中获取数据,删除前7行,然后创建新列并将该数据添加到该列中的所有行(对于文档的长度)。

我觉得有一点让我很难,因为read_html()不起作用,我必须使用read_excel()。我收到以下错误:

Traceback (most recent call last):
  File "script.py", line 83, in <module>
    sheet = pd.read_excel(f)
  File "C:\Users\Kenny\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\excel.py", line 200, in read_excel
    io = ExcelFile(io, engine=engine)
  File "C:\Users\Kenny\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\excel.py", line 257, in __init__
    self.book = xlrd.open_workbook(io)
  File "C:\Users\Kenny\AppData\Local\Programs\Python\Python36-32\lib\site-packages\xlrd\__init__.py", line 441, in open_workbook
    ragged_rows=ragged_rows,
  File "C:\Users\Kenny\AppData\Local\Programs\Python\Python36-32\lib\site-packages\xlrd\book.py", line 91, in open_workbook_xls
    biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
  File "C:\Users\Kenny\AppData\Local\Programs\Python\Python36-32\lib\site-packages\xlrd\book.py", line 1230, in getbof
    bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
  File "C:\Users\Kenny\AppData\Local\Programs\Python\Python36-32\lib\site-packages\xlrd\book.py", line 1224, in bof_error
    raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\n<html>\n'

1 个答案:

答案 0 :(得分:1)

这是我的最终合并解决方案(也有一个很好的小动态更改打印语句):

# Merge all .xlsx files into one 'master.xlsx'

files = get_files('/xlsx/', '.xlsx')
df_array = []
all_data = pd.DataFrame()
writer = pd.ExcelWriter('master.xlsx')

for i, f in enumerate(files, start=1):
    sheet = pd.read_excel(f)
    all_data = all_data.append(sheet, ignore_index=True)

    # progress of entire list
    if i <= len(files):
        print('\r{:*^7}{:.0f}%'.format('Merging: ', i/len(files)*100), end='')

all_data.to_excel(writer, 'sheet')
writer.save()