Question

我将重新阐述我之前所说的问题：

我目前只想从一个大约有6亿的文件中读取大约2600万行。我目前有一个列表，其中包含我感兴趣的2600万行。

我的解决方案如下：

## list_ is a list of indices with the number of the 26MM rows 

# First, open the output file where i want to copy the 26MM rows
with open(output_file,'w') as g:
# Open the source file with 600MM rows
  with open(source_file,'r') as f:
    for i,line in enumerate(f):
      if i in list_:
        g.write(line)

考虑到列表的大小和原始文件的大小，我担心处理此文件可能需要很长时间。我知道其他问题已经涵盖了这个主题，但我认为其他帖子没有询问文本文件何时非常大。

感谢您对之前令人困惑的帖子道歉，

Answer 1

如果您只想检查某个值是否在该列表中，则每次检查的最佳时间为O（1）。您可能希望使用hashset而不是列表。您可以在python中使用google hashset查看一些示例，例如this或查看此文档，sets。

读取CSV文件，只根据列表中的值保留一些行（Python）

1 个答案: