Question

我正在阅读一些在线存储为excel的数据文件。我当前的过程涉及使用下面定义的检索函数将文件下载到磁盘，该函数使用urllib2库，然后使用traverseWorkbook函数解析excel文档。遍历函数使用xlrd库来解析excel。

我想执行相同的操作，而不需要将文件下载到磁盘，但更愿意将文件保存在内存中并解析内存。

不确定如何进行，但我确定它可行。

def retrieveFile(url, filename):
    try:
        req = urllib2.urlopen(url)
        CHUNK = 16 * 1024
        with open(filename, 'wb') as fp:
            while True:
                chunk = req.read(CHUNK)
                if not chunk: break
                    fp.write(chunk)
        return True
    except Exception, e:
        return None


def traverseWorkbook(filename):
    values = []

    wb = open_workbook(filename)
    for s in wb.sheets():
        for row in range(s.nrows):
           if row > 10:
               rowData = processRow(s, row, type)
               if rowData:
                   values.append(rowData)

Answer 1

您可以使用以下方法将整个文件读入内存：

data = urllib2.urlopen(url).read()

文件在内存中后，您可以使用xlrd的{{1}}参数将其加载到file_contents：

open_workbook

将url作为文件名传递，因为文档说明它可能在消息中使用;否则，它将被忽略。

因此，您的wb = xlrd.open_workbook(url, file_contents=data)方法可以重写为：

traverseWorbook

Answer 2

您可以使用StringIO库并将下载的数据写入类似文件的StringIO对象，而不是普通文件。

import cStringIO as cs
from contextlib import closing

def retrieveFile(url, filename):
    try:
        req = urllib2.urlopen(url)
        CHUNK = 16 * 1024
        full_str = None
        with closing(cs.StringIO()) as fp:
            while True:
                chunk = req.read(CHUNK)
                if not chunk: break
                    fp.write(chunk)
            full_str = fp.getvalue()  # This contains the full contents of the downloaded file.
        return True
    except Exception, e:
        return None

Answer 3

您可以使用pandas。它的优点是它可以优化处理内存中的数据，因为计算是在C中完成的，而不是实际的Python。它还抽象了下载数据时出现的许多混乱细节。

import pandas as pd

xl = pd.ExcelFile(url, engine='xlrd')
sheets = xl.sheet_names

# work with the first sheet, or iterate through sheets if there are more than one.
df = xl.parse(sheets[0])

# The file is now a dataframe.
# You can manipulate the data in memory using the Pandas API
# ...
# ...

# after massaging the data, write to to an xls file:
out_file = '~/Documents/out_file.xls'
data.to_excel(out_file, encoding='utf-8', index=False)

使用python在内存中处理文件

3 个答案: