Question

我有这个方法：

def get_chunksize(path):
    """
    Breaks a file into chunks and yields the chunk sizes.
    Number of chunks equals the number of available cores.
    Ensures that each chunk ends at an EOL.
    """
    size = os.path.getsize(path)
    cores = mp.cpu_count()
    chunksize = size/cores # gives truncated integer

    f = open(path)
    while 1:
        start = f.tell()
        f.seek(chunksize, 1) # Go to the next chunk
        s = f.readline() # Ensure the chunk ends at the end of a line
        yield start, f.tell()-start
        if not s:
            break

应该将文件分成块并返回块的开头（以字节为单位）和块大小。

至关重要的是，一个块的结尾应该对应一行的结尾（这就是f.readline()行为存在的原因），但我发现我的块根本没有寻求EOL。< / p>

该方法的目的是读取可以传递给csv.reader实例（通过StringIO）进行进一步处理的块。

我一直无法发现任何明显错误的功能...任何想法为什么它没有转移到EOL？

我提出了这个相当笨重的选择：

def line_chunker(path):
    size = os.path.getsize(path)
    cores = mp.cpu_count()
    chunksize = size/cores # gives truncated integer

    f = open(path)

    while True:
        part = f.readlines(chunksize)
        yield csv.reader(StringIO("".join(part)))
        if not part:
            break

这会将文件分成块，每个块都有一个csv阅读器，但最后一个块总是空的（??），并且必须将字符串列表重新加入一起是相当笨重的。

Answer 1

if not s:
        break

不要查看s以查看您是否在文件末尾，而应使用以下命令查看是否已到达文件末尾：

if size == f.tell(): break

这应该解决它。我不会依赖于每行只有一条记录的CSV文件。我使用了几个包含新行字符串的CSV文件：

first,last,message
sue,ee,hello
bob,builder,"hello,
this is some text
that I entered"
jim,bob,I'm not so creative...

注意第2条记录（bob）横跨3行。 csv.reader可以处理这个问题。如果想的是在csv上做一些cpu密集型工作。我创建了一个线程数组，每个线程都有一个n个记录的缓冲区。让csv.reader使用循环法将记录传递给每个线程，如果缓冲区已满，则跳过该线程。
希望这有助于 - 享受。

Python：在文件中寻找EOL不起作用

1 个答案: