熊猫读取CSV缺少行

时间:2020-05-12 15:58:11

标签: python pandas csv

我有一个很大的CSV文件,包含1600万行,如下所示:

    <c:forEach items="${pools}" var="pool"> 

        ${pool.name}

    </c:forEach>

但是,当我使用with open(r'file.csv') as fp: count = 0 for _ in fp: count += 1 print(count) 16817381 进行阅读时,我只会看到15M +行

pandas.read_csv

文件格式质量不好。它总共有27列,但是有些行的其他列中有值。我怀疑这会导致错误。

例如,如果我未在df = pd.read_csv(r'file.csv', low_memory = False, usecols = [0, 13, 4, 5, 6, 7, 8, 11]) df.shape[0] 15234809 中指定任何内容,则会看到以下错误:

usecols

我检查了类似的问题,并尝试添加诸如Error tokenizing data. C error: Expected 27 fields in line 189, saw 28 之类的参数,但没有任何效果。

有人可以建议吗?谢谢!

1 个答案:

答案 0 :(得分:1)

尝试这样的事情:

import pandas as pd
import csv

def ReadRows(stream, max_length=None):
    #get data in rows from stream
    rows = csv.reader(stream)
    #set max length
    if max_length is None:
        rows = list(rows)
        max_length = max(len(row) for row in rows)
    for row in rows:
        yield row + [None] * (max_length - len(row))

with open('yourFile.csv') as f:
    df = pd.DataFrame.from_records(list(ReadRows(f)))