Question

我正在处理一个非常大的文本文件（tsv），大约有2亿条目。其中一列是日期，记录按日期排序。现在我想开始阅读给定日期的记录。目前我只是从开始阅读，这是非常缓慢，因为我需要阅读几乎1亿至1.5亿条记录才能达到该记录。我在想是否可以使用二进制搜索来加速它，我可以在最多28个额外的记录读取（log（2亿））中消除。 python是否允许在没有缓存或读取行之前读取第n行？

Answer 1

如果文件长度不固定，那你就不走运了。某些功能必须读取文件。如果文件是固定长度，则可以打开文件，使用函数file.seek(line*linesize)。然后从那里读取文件。

Answer 2

如果要读取的文件很大，并且您不想一次读取内存中的整个文件：

fp = open("file")
for i, line in enumerate(fp):
    if i == 25:
        # 26th line
    elif i == 29:
        # 30th line
    elif i > 29:
        break
fp.close()

请注意第n行的i == n-1。

Answer 3

您可以使用方法fileObject.seek(offset[, whence])

#offset -- This is the position of the read/write pointer within the file.

#whence -- This is optional and defaults to 0 which means absolute file positioning, other values are 1 which means seek relative to the current position and 2 means seek relative to the file's end.


file = open("test.txt", "r")
line_size = 8 # Because there are 6 numbers and the newline
line_number = 5
file.seek(line_number * line_size, 0)
for i in range(5):
    print(file.readline())
file.close()

对于此代码，我使用下一个文件：

Answer 4

python无法跳过＆＃34; line＆＃34;在一个文件中。我知道的最好方法是使用生成器根据特定条件产生线，即date > 'YYYY-MM-DD'。至少这样可以减少内存使用量。花在i / o上的时间。

示例：

# using python 3.4 syntax (parameter type annotation)

from datetime import datetime

def yield_right_dates(filepath: str, mydate: datetime):

    with open(filepath, 'r') as myfile:

        for line in myfile:
        # assume:
        #    the file is tab separated (because .tsv is the extension) 
        #    the date column has column-index == 0
        #    the date format is '%Y-%m-%d'
            line_splt = line.split('\t')
            if datetime.strptime(line_splt[0], '%Y-%m-%d') > mydate:
                yield line_splt

my_file_gen = yield_right_dates(filepath='/path/to/my/file', mydate=datetime(2015,01,01))
# then you can do whatever processing you need on the stream, or put it in one giant list.
desired_lines = [line for line in my_file_gen]

但这仍然限制你使用一个处理器:(

假设您使用类似unix的系统并且bash是您的shell ，我会使用shell实用程序split拆分文件，然后使用多处理并定义生成器上方。

我现在没有要测试的大文件，但是稍后我将使用基准测试更新此答案，然后使用生成器和多处理模块对其进行拆分然后迭代。

随着对文件的更多了解（例如，如果所有期望的日期都集中在开头|中心|），您可以进一步优化读取。

Python转到文本文件行而不读取前一行

4 个答案: