Question

使用python的readlines()函数，我可以检索文件中每行的列表：

with open('dat.csv', 'r') as dat:
    lines = dat.readlines()

我正在处理涉及非常大的文件的问题，并且此方法产生内存错误。有没有相当于Python的readlines()函数的pandas？ pd.read_csv()选项chunksize似乎在我的行中添加了数字，这远非理想。

最小例子：

In [1]: lines = []

In [2]: for df in pd.read_csv('s.csv', chunksize = 100):
   ...:     lines.append(df)
In [3]: lines
Out[3]: 
[   hello here is a line
 0  here is another line
 1  here is my last line]

In [4]: with open('s.csv', 'r') as dat:
   ...:     lines = dat.readlines()
   ...:     

In [5]: lines
Out[5]: ['hello here is a line\n', 'here is another line\n', 'here is my last line\n']

In [6]: cat s.csv
hello here is a line
here is another line
here is my last line

Answer 1

您应该尝试使用chunksize的{{1}}选项，如某些评论中所述。

这将强制pd.read_csv()一次读取一定数量的行，而不是一次性读取整个文件。它看起来像这样：

pd.read_csv()

在上面的示例中，将逐行读取文件。

现在，事实上，根据pandas.read_csv的文档，它不是在这里返回的>> df = pd.read_csv(filepath, chunksize=1, header=None, encoding='utf-8')对象，而是pandas.DataFrame对象。

chunksize：int，默认无


返回TextFileReader对象以进行迭代。有关iterator和chunksize的更多信息，请参阅IO Tools文档。

因此，为了完成练习，你需要把它放在这样的循环中：

TextFileReader

我希望这有帮助！

Pandas相当于Python的readlines函数

1 个答案: