Question

我正在使用numpy.fromfile来构造一个我可以传递给pandas.DataFrame构造函数的数组

import numpy as np
import pandas as pd

def read_best_file(file, **kwargs):
    '''
    Loads best price data into a dataframe
    '''
    names   = [ 'time', 'bid_size', 'bid_price', 'ask_size', 'ask_price' ]
    formats = [ 'u8',   'i4',       'f8',        'i4',       'f8'        ]
    offsets = [  0,      8,          12,          20,         24         ]

    dt = np.dtype({
            'names': names, 
            'formats': formats,
            'offsets': offsets 
        })
    return pd.DataFrame(np.fromfile(file, dt))

我想扩展此方法以使用gzip压缩文件。

根据numpy.fromfile文档，第一个参数是file：

file : file or str
Open file object or filename

因此，我添加了以下内容以检查gzip文件路径：

if isinstance(file, str) and file.endswith(".gz"):
    file = gzip.open(file, "r")

但是，当我尝试通过fromfile构造函数传递此内容时，我得到IOError：

IOError: first argument must be an open file

问题：

如何使用gzip压缩文件调用numpy.fromfile？

修改

根据评论中的请求，显示检查gzip压缩文件的实现：

def read_best_file(file, **kwargs):
    '''
    Loads best price data into a dataframe
    '''
    names   = [ 'time', 'bid_size', 'bid_price', 'ask_size', 'ask_price' ]
    formats = [ 'u8',   'i4',       'f8',        'i4',       'f8'        ]
    offsets = [  0,      8,          12,          20,         24         ]

    dt = np.dtype({
            'names': names, 
            'formats': formats,
            'offsets': offsets 
        })

    if isinstance(file, str) and file.endswith(".gz"):
        file = gzip.open(file, "r")

    return pd.DataFrame(np.fromfile(file, dt))

Answer 1

open.gzip()没有返回真正的file对象。它是鸭子......它像鸭子一样走路，听起来像一只鸭子，但是numpy并不是一只鸭子。所以numpy是严格的（因为很多是用较低级别的C代码编写的，它可能需要一个实际的文件描述符。）

您可以从file电话中获取基础gzip.open()，但这样才能获得压缩流。

这就是我要做的：我会使用subprocess.Popen()来调用zcat来将文件解压缩为流。

>>> import subprocess
>>> p = subprocess.Popen(["/usr/bin/zcat", "foo.txt.gz"], stdout=subprocess.PIPE)
>>> type(p.stdout)
<type 'file'>
>>> p.stdout.read()
'hello world\n'

现在，您可以将p.stdout作为file对象传递给numpy：

np.fromfile(p.stdout, ...)

Answer 2

我已经成功地通过numpy.frombuffer（）读取了read（）结果，从gzip压缩的文件中读取了原始二进制数据的数组。该代码在Python 3.7.3中有效，也许在较早的版本中也适用。

gcc main.c -o main -lfftw3 -lm -Wall

numpy：来自gzip文件的文件

2 个答案: