Python中的XML到XLSX

时间:2017-02-11 22:20:59

标签: xml xls

我已经搜索了高低的答案,但似乎并不是一个明确的解决方案。这是:

from selenium import webdriver

chromedriver_path = ("localchromedrive/chromedriver.exe")
chromeOptions = webdriver.ChromeOptions()
MSCI_dir = ("mylocaldrive")
prefs = {"download.default_directory" : MSCI_dir}
chromeOptions.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(chromedriver_path,chrome_options=chromeOptions)
url = "https://www.ishares.com/us/239637/fund-download.dl"
driver.get(url)

该文件现在以本地路径下载并保存为以下内容:

temp_path = "mylocaldrive\iShares-MSCI-Emerging-Markets-ETF_fund.xls"

此文件保存为" .xls"文件类型,但它显然是一个XML文件。请参阅下面的NotePad中打开的文件。 enter image description here

我试过xlrd:

import xlrd
book = xlrd.open_workbook(temp_path)
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'

我已尝试过xml.etree:

import xml.etree.ElementTree as ET
tree = ET.parse(temp_path)
File "<string>", line unknown
ParseError: mismatched tag: line 16, column 2`

我尝试过xlwings:

wb = xw.Book(temp_path)
wb.save(xlsx_path)
wb.close()`

看起来很有效,但是当我尝试使用pandas时,我得到了这个:

pd.read_excel(xlsx_path)
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'`

我尝试过BeautifulSoup

from bs4 import BeautifulSoup`
soup = BeautifulSoup(open(temp_path), "xml")`

In [1]: soup
Out[1]: <?xml version="1.0" encoding="utf-8"?>`

In [2]: soup.contents
Out[2]: []`

In [3]: soup.get_text()
Out[3]: ''`

我正在寻找使用pandas访问此文件的权威方法。让我知道你需要的我缺少的信息。

1 个答案:

答案 0 :(得分:0)

我认为您的问题是该文件不是XLS,而是XLSX文件,它是Microsoft为减少DOC和XLS文件大小而制作的特殊XML文件。

查找: https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats

https://msdn.microsoft.com/en-us/library/dd922181(v=office.12).aspx