Question

我的脚本使用pandas.read_csv直接将读取的csv文件读入数据帧。用户应该提供一个配置文件，其中包含要保留的列标签以及相应的数据类型。让我举一个例子来说明。这是一个csv。

factor,value1,value2,value3
a,1,2.0,1
b,3,4.1,2
c,5,6.2,3

配置文件将是

factor char
value1 integer
value2 float

我有一个将配置转换为字典的功能

col_types = read_types(config)  # -> {"factor": str, "value1": numpy.int32, "value2": numpy.float64}

然后我读了csv

df = pandas.read_csv(csv_file, header=0, sep=",", index_col=False, dtype=col_types)

有时，配置文件中列出的标签在字母大小写一致性方面与csv文件中的标签不匹配，例如它可以是配置文件中的value1和csv中的Value1。因此，我希望能够获得一个FileIO流，自动将行转换为小写。我试过这个

def read_df(...):
    with open(csv_file) as csv:
        lower_lines = (line.lower() for line in csv)
        return pandas.read_csv(lower_lines, header=0, sep=",", 
                               index_col=False, dtype=col_types)

此操作失败，因为pandas.read_csv需要文件路径或FileIO流，而不是生成器。

然后我尝试使用可以读取生成器的DataFrame.from_records，但没有dtype pandas.read_csv参数。最后，我最终用自己的类

模仿了一个文件

class GeneratorWrapper(object):
    def __init__(self, generator):
        """
        :type generator: Generator[str]
        """
        self._generator = generator

    def read(self, n=0):
        return next(self._generator, "")

def read_df(...):
    with open(csv_file) as csv:
        lower_lines = GeneratorWrapper((line.lower() for line in csv))
        return pandas.read_csv(lower_lines, header=0, sep=",", 
                               index_col=False, dtype=col_types)

这件事有效，但我认为这有点过分。我相信应该有更多Pythonic来获得预处理的FileIO流。为了缩短它，有没有更好的方法在Python中获得一个延迟处理的类似FileIO的流？

注意。

我知道我可以读取所有行，处理它们并传递给pandas.DataFrame，但我不想使用任何中间Python容器，因为文件很大而且我不想运行进入内存错误。

Python中的预处理FileIO生成器

0 个答案: