解析文件中包含特殊字符的固定宽度文件?

时间:2014-08-30 19:47:04

标签: python django

我正在解析固定宽度的文件。我遇到了特定字符串的问题。字符串如下所示:

(Pdb) record.description 'D\xc3\xa9cision financi\xc3\xa8re \xc3\xa0 long trem corrig\xc3\xa9'

我正在解析的固定宽度文件如下所示:

LI 41000001009 Décision financière à long trem corrigé 14 00001100 0000000000 0000000000 00080000 000000 00000 00000 00000 00081 N 05062006 00000273 00 00000000 00000001 00000000 00000000 -------- 000005

以及解析它并将其导入数据库的代码在这里:

import struct, cStringIO, MySQLdb, glob, os, settings
from django.template.defaultfilters import slugify

cnv_text = lambda s: s.rstrip()

fieldspecs = [
    ('plu_number', 3, 15, cnv_text),
    ('description', 19, 80, cnv_text),
    ('price', 104, 8, cnv_text),
    ('member_price', 113, 8, cnv_text),
]

fieldspecs.sort(key=lambda x: x[1])

unpack_len = 0
unpack_fmt = ""
for fieldspec in fieldspecs:
    start = fieldspec[1] - 1
    end = start + fieldspec[2]
    if start > unpack_len:
        unpack_fmt += str(start - unpack_len) + "x"
    unpack_fmt += str(end - start) + "s"
    unpack_len = end
field_indices = range(len(fieldspecs))
unpacker = struct.Struct(unpack_fmt).unpack_from

class Record(object):
    pass

path = settings.PATH
files_to_delete = settings.GUTTER

for fname in glob.glob(path):
    with open(fname, 'r') as f:
        f = cStringIO.StringIO(f.read())
        for line in f:
            raw_fields = unpacker(line)
            record = Record()
            for x in field_indices:
                setattr(record, fieldspecs[x][0], fieldspecs[x][3](raw_fields[x]))

            db = MySQLdb.connect('localhost', settings.USER, settings.PASS, settings.DBNAME)
            cursor = db.cursor()
            fixed_member_price = int(record.member_price) / 100.0
            real_price = int(record.price) / 100.0
            try:
                cursor.execute(
                    "INSERT INTO catalog_product \
                     (name, slug, price, member_price, plu_number, description, old_price, is_active, is_featured, quantity, meta_description, image) \
                     VALUES \
                     ('%s', '%s', '%s', '%s', '%s', '%s', '00.00', false, false, 1, '', '/media/images/thumbnail-default.jpg')",
                     [record.description, slugify(record.description), str(real_price), str(fixed_member_price), record.plu_number, record.description]
                )
                db.commit()
            except:
                db.rollback()
            db.close()
for the_file in os.listdir(files_to_delete):
    file_path = os.path.join(files_to_delete, the_file)
    try:
        if os.path.isfile(file_path):
            os.unlink(file_path)
    except Exception, e:
        print e

此代码适用于使用普通字符串一次导入数千条记录,但只要导入带有特殊字符,它就不会导入。我认为这是因为描述字段从第19列开始并在80结束,特殊字符添加超过80的字符并且它出错,因为它不能映射其余字段。有没有人知道我可以保留utf-8字符串格式的方法,所以它不会尝试导入'D\xc3\xa9cision financi\xc3\xa8re \xc3\xa0 long trem corrig\xc3\xa9'

1 个答案:

答案 0 :(得分:2)

是UTF-8字符串。

>>> print 'D\xc3\xa9cision financi\xc3\xa8re \xc3\xa0 long trem corrig\xc3\xa9'.decode('utf-8')
Décision financière à long trem corrigé