跳过.txt中的行导入到PostgreSQL

时间:2019-04-25 14:43:12

标签: python python-2.7 psycopg2

我正在尝试将5000个.txt文件导入到PostgreSQL数据库中。只要脚本的行不适合格式,我的脚本就可以正常运行。例如,每个文件的末尾都有一个新行,这也会导致脚本崩溃。

我尝试处理异常,但是没有成功...

我的脚本:

import csv
import os
import sys

import psycopg2

conn = psycopg2.connect(
    host="localhost",
    database="demo",
    user="demo",
    password="123",
    port="5432"
)

cur = conn.cursor()

maxInt = sys.maxsize

while True:
    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt / 10)


def searchFiles(directory='', extension=''):
    print('SEARCHING IN: ', directory)
    filelist = []
    extension = extension.lower()
    for dirpath, dirnames, files in os.walk(directory):
        for name in files:
            if extension and name.lower().endswith(extension):
                filelist.append(os.path.join(dirpath, name))
            elif not extension:
                print('FAILED TO READ: ', (os.path.join(dirpath, name)))
    print('FINISHED FILE SEARCH AND FOUND ', str(len(filelist)), ' FILES')
    return filelist


def importData(fileToImport):
    with open(fileToImport, 'r') as f:
        reader = csv.reader(f, delimiter=':')

        for line in reader:
            try:
                cur.execute("""INSERT INTO demo VALUES (%s, %s)""", (line[0], line[1]))
                conn.commit()
            except:
                pass
                print('FAILED AT LINE: ', line)


print(conn.get_dsn_parameters())
cur.execute("SELECT version();")
record = cur.fetchone()
print("You are connected to - ", record)

fileList = searchFiles('output', '.txt')

counter = 0
length = len(fileList)
for file in fileList:
    # if counter % 10 == 0:
    print('Processing File: ', str(file), ', COMPLETED: ', str(counter), '/', str(length))
    importData(str(file))
    counter += 1
print('FINISHED IMPORT OF ', str(length), ' FILES')

我要导入的几行数据:

example1@example.com:123456
example2@example.com:password!1

我得到的错误:

File "import.py", line 66, in <module>
    importData(str(file))
File "import.py", line 45, in importData
    for line in reader:
_csv.Error: line contains NULL byte

我应该如何处理无法导入的行?

感谢您的帮助

1 个答案:

答案 0 :(得分:0)

您的回溯会在for line in reader中显示异常的来源:

File "import.py", line 45, in importData
    for line in reader:
_csv.Error: line contains NULL byte

,此时您不处理异常。如异常所示,它是由您的csv阅读器实例引发的。虽然您当然可以将for循环包装在try-except块中,但是一旦引发异常,循环仍将结束。

此异常可能是由于文件的编码与您的区域设置不同而引起的,如果未明确提供编码,则open()会假定:

  

在文本模式下,如果未指定编码,则使用的编码为   取决于平台:locale.getpreferredencoding(False)被调用来   获取当前的语言环境编码。

this“问题与解答”中接受的答案概述了解决该问题的解决方案,前提是您可以识别用于打开文件的正确编码。问答环节还展示了一些在将文件交给读取器之前如何清除文件中NULL字节的方法。

您可能还想跳过空行,而不是将空行触发到数据库并处理异常,例如

for line in reader:
    if not line:
        continue
    try:
        [...]