在我的ML项目中,我开始遇到大小超过10 Gb的csv文件,因此我试图实现一种有效的方法来从csv文件中抓取特定行。
这导致我发现itertools
(据推测可以有效地跳过csv.reader
的行,而循环遍历它会将行进的每一行加载到内存中),然后跟随{{3 }}我尝试了以下方法:
import collections
import itertools
with open(csv_name, newline='') as f:
## Efficiently find total number of lines in csv
lines = sum(1 for line in f)
## Proceed only if my csv has more than just its header
if lines < 2:
return None
else:
## Read csv file
reader = csv.reader(f, delimiter=',')
## Skip to last line
consume(reader, lines)
## Output last row
last_row = list(itertools.islice(reader, None, None))
consume()
为
def consume(iterator, n):
"Advance the iterator n-steps ahead. If n is none, consume entirely."
# Use functions that consume iterators at C speed.
if n is None:
# feed the entire iterator into a zero-length deque
collections.deque(iterator, maxlen=0)
else:
# advance to the empty slice starting at position n
next(itertools.islice(iterator, n, n), None)
但是,我仅从last_row
那里得到一个空列表,这意味着出了点问题。
我正在测试以下代码的简短csv:
Author,Date,Text,Length,Favorites,Retweets
Random_account,2019-03-02 19:14:51,twenty-two,10,0,0
我要去哪里错了?
答案 0 :(得分:1)
出问题了,您正在遍历文件以使它的长度耗尽文件迭代器,
lines = sum(1 for line in f)
您需要重新打开文件,或使用f.seek(0)
。
所以:
def get_last_line(csv_name):
with open(csv_name, newline='') as f:
## Efficiently find total number of lines in csv
lines = sum(1 for line in f) # the iterator is now exhausted
if len(lines) < 2:
return
with open(csv_name, newline='') as f: # open file again
# Keep going with your function
...
或者,
def get_last_line(csv_name):
with open(csv_name, newline='') as f:
## Efficiently find total number of lines in csv
lines = sum(1 for line in f) # the iterator is now exhausted
if len(lines) < 2:
return
# we can "cheat" the iterator protocol and
# and move the iterator back to the beginning
f.seek(0)
... # continue with the function
但是,如果您想要最后一行,则只需执行以下操作:
for line in f:
pass
print(line)
也许使用collections.deque
会更快(他们在食谱中使用它):
collections.deque(f, maxlen=1)
有两种方法可以解决此问题,让我快速创建一个文件:
Juans-MacBook-Pro:tempdata juan$ history > history.txt
Juans-MacBook-Pro:tempdata juan$ history >> history.txt
Juans-MacBook-Pro:tempdata juan$ history >> history.txt
Juans-MacBook-Pro:tempdata juan$ history >> history.txt
Juans-MacBook-Pro:tempdata juan$ cat history.txt | wc -l
2000
好的,在IPython中:
In [1]: def get_last_line_fl(filename):
...: with open(filename) as f:
...: prev = None
...: for line in f:
...: prev = line
...: if prev is None:
...: return None
...: else:
...: return line
...:
In [2]: import collections
...: def get_last_line_dq(filename):
...: with open(filename) as f:
...: last_two = collections.deque(f, maxlen=2)
...: if len(last_two) < 2:
...: return
...: else:
...: return last_two[-1]
...:
In [3]: %timeit get_last_line_fl('history.txt')
1000 loops, best of 3: 337 µs per loop
In [4]: %timeit get_last_line_dq('history.txt')
1000 loops, best of 3: 339 µs per loop
In [5]: get_last_line_fl('history.txt')
Out[5]: ' 588 history >> history.txt\n'
In [6]: get_last_line_dq('history.txt')
Out[6]: ' 588 history >> history.txt\n'