我有一个很大的CSV文件,包含1600万行,如下所示:
<c:forEach items="${pools}" var="pool">
${pool.name}
</c:forEach>
但是,当我使用with open(r'file.csv') as fp:
count = 0
for _ in fp:
count += 1
print(count)
16817381
进行阅读时,我只会看到15M +行
pandas.read_csv
文件格式质量不好。它总共有27列,但是有些行的其他列中有值。我怀疑这会导致错误。
例如,如果我未在df = pd.read_csv(r'file.csv', low_memory = False, usecols = [0, 13, 4, 5, 6, 7, 8, 11])
df.shape[0]
15234809
中指定任何内容,则会看到以下错误:
usecols
我检查了类似的问题,并尝试添加诸如Error tokenizing data. C error: Expected 27 fields in line 189, saw 28
之类的参数,但没有任何效果。
有人可以建议吗?谢谢!
答案 0 :(得分:1)
尝试这样的事情:
import pandas as pd
import csv
def ReadRows(stream, max_length=None):
#get data in rows from stream
rows = csv.reader(stream)
#set max length
if max_length is None:
rows = list(rows)
max_length = max(len(row) for row in rows)
for row in rows:
yield row + [None] * (max_length - len(row))
with open('yourFile.csv') as f:
df = pd.DataFrame.from_records(list(ReadRows(f)))