我有一个很大的 <select>
<option value="1">A</option>
<option value="2">B</option>
<option value="3">C</option>
<option value="3">D</option>
</select>
文件,我想将其导入到pandas数据帧中。不幸的是,该文件的列数不均匀。数据大致采用以下格式:
gzip
作为测试,我尝试了这个:
.... Col_20: 25 Col_21: 23432 Col22: 639142
.... Col_20: 25 Col_22: 25134 Col23: 243344
.... Col_21: 75 Col_23: 79876 Col25: 634534 Col22: 5 Col24: 73453
.... Col_20: 25 Col_21: 32425 Col23: 989423
.... Col_20: 25 Col_21: 23424 Col22: 342421 Col23: 7 Col24: 13424 Col 25: 67
.... Col_20: 95 Col_21: 32121 Col25: 111231
以下是我得到的错误:
import pandas as pd
filename = `path/to/filename.gz`
for chunk in pd.read_csv(filename, sep='\t', chunksize=10**5, engine='python'):
print(chunk)
如何为pandas.read_csv()?
分配一定数量的列答案 0 :(得分:1)
你也可以试试这个:
for chunk in pd.read_csv(filename, sep='\t', chunksize=10**5, engine='python', error_bad_lines=False):
print(chunk)
error_bad_lines
会跳过错误思路。我会看看是否可以找到更好的替代方案
编辑:为了维护error_bad_lines
跳过的行,我们可以检查错误并将其添加回数据框
line = []
expected = []
saw = []
cont = True
while cont == True:
try:
data = pd.read_csv('file1.csv',skiprows=line)
cont = False
except Exception as e:
errortype = e.message.split('.')[0].strip()
if errortype == 'Error tokenizing data':
cerror = e.message.split(':')[1].strip().replace(',','')
nums = [n for n in cerror.split(' ') if str.isdigit(n)]
expected.append(int(nums[0]))
saw.append(int(nums[2]))
line.append(int(nums[1])-1)
else:
cerror = 'Unknown'
print 'Unknown Error - 222'