Question

f = open("data.csv")
f.seek(0) 
f_reader = csv.reader(f)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

以上是我用来读取csv文件的代码。 csv文件只有大约800 MB，我正在使用 64位系统，其中 8GB 的Ram。该文件包含1亿行。然而，更不用说阅读整个文件了，即使阅读前1000万行，也会给我一个＆＃39; MemoryError：＆＃34; ＆lt; - 这实际上是整个错误消息。

有人可以告诉我为什么好吗？另外作为一个附带问题，有人可以告诉我如何阅读，请说20万行吗？我知道我需要使用f.seek（某些数字），但由于我的数据是一个csv文件，我不知道我应该将哪个数字准确地放入f.seek（）中，以便它从第20行精确读取。

非常感谢。

Answer 1

有人可以告诉我怎么读，请说20万行吗？我知道我需要使用f.seek（某个数字）

不，在这种情况下你不能（并且不能）使用f.seek()。相反，你必须以某种方式阅读前2000万行中的每一行。

Python documentation有这个收件人：

def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)

使用它，你将在20,000,000行之后开始：

#UNTESTED
f = open("data.csv")
f_reader = csv.reader(f)
consume(f_reader, 20000000)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

或者这可能会更快：

#UNTESTED
f = open("data.csv")
consume(f, 20000000)
f_reader = csv.reader(f)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

64位系统，8GB RAM，超过800MB的CSV和python读取会导致内存错误

1 个答案: