Question

我正在使用下面的Python脚本将放置在服务器中的CSV文件导入到PostgreSQL表中。

但是我遇到了以下错误。

Error while fetching data from PostgreSQL COPY from stdin failed: error in .read() call: UnicodeDecodeError 'utf-8' codec can't decode byte 0xdf in position 1237: invalid continuation byte

CSV文件位于“ ufl.csv：ISO-8859文本，带有很长的行”，并且我的服务器采用UTF编码，因此任何人都可以建议或帮助我修改以下脚本，而无需将CSV文件明确转换为UTF编码。可以通过代码完成吗？

如果我将CSV文件的编码转换为UTF，则以下代码可以正常工作。

import csv
import psycopg2
import time
import os
from datetime import datetime
import shutil

# File path.
filePath='''/Users/linu/Downloads/ufl.csv'''
dirName = '/Users/linu/Downloads/ufl_old_files/'

try:
  conn = psycopg2.connect(host="localhost", database="postgres", user="postgres", password="postgres", port="5432")

  print('DB connected')

except (Exception, psycopg2.Error) as error:
        # Confirm unsuccessful connection and stop program execution.
        print ("Error while fetching data from PostgreSQL", error)
        print("Database connection unsuccessful.")
        quit()

# Check if the CSV file exists.
if os.path.isfile(filePath):
 try:
     print('Entered loop')   
     sql = "COPY %s FROM STDIN WITH DELIMITER AS ';'  csv header"
     file = open('/Users/linu/Downloads/ufl.csv', "r")
     table = 'staging.ufl_tracking_details'

     with conn.cursor() as cur:
        cur.execute("truncate " + table + ";")
        print('truncated the table')
        cur.copy_expert(sql=sql % table, file=file)
        print('Data loaded')
        conn.commit()
        cur.close()
        conn.close()

 except (Exception, psycopg2.Error) as error:
        print ("Error while fetching data from PostgreSQL", error)
        print("Error adding  information.")
        quit()

 if not os.path.exists(dirName):
    os.mkdir(dirName)
    print("Directory " , dirName ,  " Created ")
 else:    
    print("Directory " , dirName ,  " already exists")
 tstamp = os.path.getmtime(path)   
 timestamp_name=str(time.time())
 os.rename(filePath,dirName + timestamp_name+'.csv')

else:
    # Message stating CSV file could not be located.
    print("Could not locate the CSV file.")
    quit()

浏览过帖子并使用了很少提及的“ copy_expert”，也尝试了其他一些解决方案，但是都没有解决。任何提示或建议都会有很大帮助。

注意：要求移植CSV文件，一旦移植完成，将复制的CSV文件移动到一个文件夹，并将其重命名为name + timestamp。

预先感谢

Answer 1

光标中出现的UnicodeDecodeError表示编码不匹配。显然，该文件至少包含一个德国犀利（（ß）。使用Latin-1，（ISO-8859-1）和其他编码，例如Cp1252，它被编码为0xdf，而在UTF-8中则被编码为0xc3 0x9f，因此UTF-8无法解码Latin-1编码的字符。

print(b'\xc3\x9f-'.decode("utf-8"))
# ß-
print(b'\xdf-'.decode("utf-8"))
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 0: invalid continuation byte

注意：已添加连字符（-）以强制发生invalid continuation byte错误。没有它，第二张照片将使UnicodeDecodeError的{{1}}升高。

出现这些错误的原因是，UTF-8的第一个字节在unexpected end of data处用尽，并且0x7f在两字节编码字符的范围内。 UTF-8希望再有一个字节来解码此范围内的字符。
另请参见this问答。

如果您没有为open()调用提供编码，则通过locale.getpreferredencoding(False)确定编码，该编码似乎会为您返回UTF-8。

您必须将文件的编码传递给0xdf调用：

open()

从stdin从PostgreSQL COPY中获取数据时出错

1 个答案: