ValueError:通过块将数据导入pandas.csv_reader()

时间:2016-09-08 12:57:21

标签: python pandas chunking

我有一个很大的 <select> <option value="1">A</option> <option value="2">B</option> <option value="3">C</option> <option value="3">D</option> </select> 文件,我想将其导入到pandas数据帧中。不幸的是,该文件的列数不均匀。数据大致采用以下格式:

gzip

作为测试,我尝试了这个:

.... Col_20: 25    Col_21: 23432    Col22: 639142
.... Col_20: 25    Col_22: 25134    Col23: 243344
.... Col_21: 75    Col_23: 79876    Col25: 634534    Col22: 5    Col24: 73453
.... Col_20: 25    Col_21: 32425    Col23: 989423
.... Col_20: 25    Col_21: 23424    Col22: 342421    Col23: 7    Col24: 13424    Col 25: 67
.... Col_20: 95    Col_21: 32121    Col25: 111231

以下是我得到的错误:

import pandas as pd
filename = `path/to/filename.gz`

for chunk in pd.read_csv(filename, sep='\t', chunksize=10**5, engine='python'):
    print(chunk)

如何为pandas.read_csv()?

分配一定数量的列

1 个答案:

答案 0 :(得分:1)

你也可以试试这个:

for chunk in pd.read_csv(filename, sep='\t', chunksize=10**5, engine='python', error_bad_lines=False):
print(chunk)

error_bad_lines会跳过错误思路。我会看看是否可以找到更好的替代方案

编辑:为了维护error_bad_lines跳过的行,我们可以检查错误并将其添加回数据框

line     = []
expected = []
saw      = []     
cont     = True 

while cont == True:     
    try:
        data = pd.read_csv('file1.csv',skiprows=line)
        cont = False
    except Exception as e:    
        errortype = e.message.split('.')[0].strip()                                
        if errortype == 'Error tokenizing data':                        
           cerror      = e.message.split(':')[1].strip().replace(',','')
           nums        = [n for n in cerror.split(' ') if str.isdigit(n)]
           expected.append(int(nums[0]))
           saw.append(int(nums[2]))
           line.append(int(nums[1])-1)
         else:
           cerror      = 'Unknown'
           print 'Unknown Error - 222'