需要跳过包含“值错误”

时间:2016-02-02 19:34:53

标签: python-3.x numpy pandas teradata pyodbc

我正在尝试从Teradata服务器中提取一些遗留数据,但有些记录包含未在python中注册的奇怪字符,例如“U + ffffffc2”。

目前,

  1. 我正在使用pyodbc从Teradata中提取数据

  2. 将结果放入一个numpy数组中(因为当我将它直接放入pandas时,它会将所有列解释为一个类型为string的列)

  3. 然后我将numpy数组转换为pandas数据帧,将Decimal(“09809”)和Date(“2015,11,14”)等内容更改为[09809,“11,14,2015”] < / p>

  4. 然后我尝试将其写入发生此错误的文件

    ValueError:字符U + ffffffc2不在[U + 0000; U + 10FFFF]

  5. 我无权编辑此数据,因此从客户端的角度来看,我可以做什么来跳过,或者最好在写入之前删除该字符,然后尝试将其写入文件并获取错误?

    目前,我有一个“try and except”块来跳过带有错误数据的查询,但是我必须查询至少100行的数据。所以如果我跳过它,我会丢失100行或更多行一时间然而,正如我之前提到的,我宁愿保留该行,但删除该字符。

    这是我的代码。 (随意指出任何不良做法!)

    #Python 3.4
     #Python Teradata Extraction
     #Created 01/28/16 by Maz Baig
    
     #dependencies
     import pyodbc
     import numpy as np
     import pandas as pd
     import sys
     import os
     import psutil
     from datetime import datetime
    
    
     #create a global variable for start time
     start_time=datetime.now()
     #create global process variable to keep track of memory usage
     process=psutil.Process(os.getpid())
    
     def ResultIter(curs, arraysize):
             #Get the specified number of rows at a time
             while True:
                     results = curs.fetchmany(arraysize)
                     if not results:
                             break
                     #for result in results:
                     yield results
    
     def WriteResult(curs,file_path,full_count):
             rate=100
             rows_extracted=0
             for result in ResultIter(curs,rate):
                     table_matrix=np.array(result)
                     #Get shape to make sure its not a 1d matrix
                     rows, length = table_matrix.shape
                     #if it is a 1D matrix, add a row of nothing to make sure pandas doesn't throw an error
                     if rows < 2:
                             dummyrow=np.zeros((1,length))
                             dummyrow[:]=None
                     df = pd.DataFrame(table_matrix)
                     #give the user a status update
                     rows_extracted=rows+rows_extracted
                     StatusUpdate(rows_extracted,full_count)
                     with open(file_path,'a') as f:
                             try:
                                     df.to_csv(file_path,sep='\u0001',encoding='latin-1',header=False,index=False)
                             except ValueError:
                                     #pass afterwards
                                     print("This record was giving you issues")
                                     print(table_matrix)
                                     pass
             print('\n')
             if (rows_extracted < full_count):
                     print("All of the records were not extracted")
                     #print the run durration
                     print("Duration:  "+str(datetime.now() - start_time))
                     sys.exit(3)
             f.close()
    
    
    
    
     def StatusUpdate(rows_ex,full_count):
             print("                                      ::Rows Extracted:"+str(rows_ex)+" of "+str(full_count)+"    |    Memory Usage: "+str(process.memory_info().rss/78
    
    
    
     def main(args):
             #get Username and Password
             usr = args[1]
             pwd = args[2]
             #Define Table
             view_name=args[3]
             table_name=args[4]
             run_date=args[5]
             #get the select statement as an input
             select_statement=args[6]
             if select_statement=='':
                     select_statement='*'
             #create the output filename from tablename and run date
             file_name=run_date + "_" + table_name +"_hist.dat"
             file_path="/prod/data/cohl/rfnry/cohl_mort_loan_perfnc/temp/"+file_name
             if ( not os.path.exists(file_path)):
                     #create connection
                     print("Logging In")
                     con_str = 'DRIVER={Teradata};DBCNAME=oneview;UID='+usr+';PWD='+pwd+';QUIETMODE=YES;'
                     conn = pyodbc.connect(con_str)
                     print("Logged In")
    
                     #Get number of records in the file
                     count_query = 'select count (*) from '+view_name+'.'+table_name
                     count_curs = conn.cursor()
                     count_curs.execute(count_query)
                     full_count = count_curs.fetchone()[0]
    
                     #Generate query to retrieve all of the table data
                     query = 'select '+select_statement+'  from '+view_name+'.'+table_name
                     #create cursor
                     curs = conn.cursor()
                     #execute query
                     curs.execute(query)
                     #save contents of the query into a matrix
                     print("Writting Result Into File Now")
                     WriteResult(curs,file_path,full_count)
                     print("Table: "+table_name+" was successfully extracted")
                     #print the scripts run duration
                     print("Duration:  "+str(datetime.now() - start_time))
                     sys.exit(0)
             else:
                     print("AlreadyThere Exception\nThe file already exists at "+file_path+". Please remove it before continuing\n")
                     #print the scripts run duration
                     print("Duration:  "+str(datetime.now() - start_time))
                     sys.exit(2)
    
     main(sys.argv)
    

    谢谢,

    马兹

1 个答案:

答案 0 :(得分:2)

如果只有4字节的unicode点给出错误,这可能会有所帮助。 一种解决方案是使用codecs.register_error注册一个自定义错误处理程序,它会过滤掉错误点,然后尝试解码:

import codecs

def error_handler(error):
    return '', error.end+6

codecs.register_error('nonunicode', error_handler)

b'abc\xffffffc2def'.decode(errors='nonunicode')
# gives you 'abcdef' which's exactly what you want

您可以进一步推动您的处理程序捕获更复杂的错误,有关详细信息,请参阅https://docs.python.org/3/library/exceptions.html#UnicodeErrorhttps://docs.python.org/3/library/codecs.html#codecs.register_error