用python读取凌乱的excel文件

时间:2017-01-04 17:51:40

标签: python excel pandas data-analysis

我一直在寻找堆栈交换等问题的解决方案,到目前为止我找不到一个。

我确定之前有人遇到过这个问题:我正在编写一个python脚本,它将从excel文件中提取并重新调整一些数据 - 这个问题就是excel文件充斥着不规则的格式和无关的数据。所以,在我能够找到我需要的数据表之前:

Table I need

我必须经历这样的表格:

Table I don't need

我的计划是使用某种正则表达式或字符串识别来知道在哪里拆分文件,这样我就能得到我需要的东西。但我现在遇到的问题是,每当我尝试在此文件上运行read_excel时,大熊猫都会感到害怕。

In [4]: df = pd.read_excel(open('data.xlsx','rb'), sheetname=0)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-21f5fee2b08d> in <module>()
----> 1 df = pd.read_excel(open('data.xlsx','rb'), sheetname=0)

/Users/Gus/anaconda2/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, squeeze, **kwds)
    168     """
    169     if not isinstance(io, ExcelFile):
--> 170         io = ExcelFile(io, engine=engine)
    171 
    172     return io._parse_excel(

/Users/Gus/anaconda2/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
    223             # N.B. xlrd.Book has a read attribute too
    224             data = io.read()
--> 225             self.book = xlrd.open_workbook(file_contents=data)
    226         elif isinstance(io, compat.string_types):
    227             self.book = xlrd.open_workbook(io)

/Users/Gus/anaconda2/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    420                 formatting_info=formatting_info,
    421                 on_demand=on_demand,
--> 422                 ragged_rows=ragged_rows,
    423                 )
    424             return bk

/Users/Gus/anaconda2/lib/python2.7/site-packages/xlrd/xlsx.pyc in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
    831         x12sheet = X12Sheet(sheet, logfile, verbosity)
    832         heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname)
--> 833         x12sheet.process_stream(zflo, heading)
    834         del zflo
    835 

/Users/Gus/anaconda2/lib/python2.7/site-packages/xlrd/xlsx.pyc in own_process_stream(self, stream, heading)
    551                 self.do_dimension(elem)
    552             elif elem.tag == U_SSML12 + "mergeCell":
--> 553                 self.do_merge_cell(elem)
    554         self.finish_off()
    555 

/Users/Gus/anaconda2/lib/python2.7/site-packages/xlrd/xlsx.pyc in do_merge_cell(self, elem)
    607         ref = elem.get('ref')
    608         if ref:
--> 609             first_cell_ref, last_cell_ref = ref.split(':')
    610             first_rowx, first_colx = cell_name_to_rowx_colx(first_cell_ref)
    611             last_rowx, last_colx = cell_name_to_rowx_colx(last_cell_ref)

ValueError: need more than 1 value to unpack

我写这个程序的全部意义在于,我必须进入这些文件的每一个并手动删除信息。但是如果python甚至不接受该文件,我怎么能自动化这个过程呢?我希望有人在这之前会遇到类似的问题。你的解决方案是什么?

0 个答案:

没有答案