我正在解析固定宽度的文件。我遇到了特定字符串的问题。字符串如下所示:
(Pdb) record.description
'D\xc3\xa9cision financi\xc3\xa8re \xc3\xa0 long trem corrig\xc3\xa9'
我正在解析的固定宽度文件如下所示:
LI 41000001009 Décision financière à long trem corrigé 14 00001100 0000000000 0000000000 00080000 000000 00000 00000 00000 00081 N 05062006 00000273 00 00000000 00000001 00000000 00000000 -------- 000005
以及解析它并将其导入数据库的代码在这里:
import struct, cStringIO, MySQLdb, glob, os, settings
from django.template.defaultfilters import slugify
cnv_text = lambda s: s.rstrip()
fieldspecs = [
('plu_number', 3, 15, cnv_text),
('description', 19, 80, cnv_text),
('price', 104, 8, cnv_text),
('member_price', 113, 8, cnv_text),
]
fieldspecs.sort(key=lambda x: x[1])
unpack_len = 0
unpack_fmt = ""
for fieldspec in fieldspecs:
start = fieldspec[1] - 1
end = start + fieldspec[2]
if start > unpack_len:
unpack_fmt += str(start - unpack_len) + "x"
unpack_fmt += str(end - start) + "s"
unpack_len = end
field_indices = range(len(fieldspecs))
unpacker = struct.Struct(unpack_fmt).unpack_from
class Record(object):
pass
path = settings.PATH
files_to_delete = settings.GUTTER
for fname in glob.glob(path):
with open(fname, 'r') as f:
f = cStringIO.StringIO(f.read())
for line in f:
raw_fields = unpacker(line)
record = Record()
for x in field_indices:
setattr(record, fieldspecs[x][0], fieldspecs[x][3](raw_fields[x]))
db = MySQLdb.connect('localhost', settings.USER, settings.PASS, settings.DBNAME)
cursor = db.cursor()
fixed_member_price = int(record.member_price) / 100.0
real_price = int(record.price) / 100.0
try:
cursor.execute(
"INSERT INTO catalog_product \
(name, slug, price, member_price, plu_number, description, old_price, is_active, is_featured, quantity, meta_description, image) \
VALUES \
('%s', '%s', '%s', '%s', '%s', '%s', '00.00', false, false, 1, '', '/media/images/thumbnail-default.jpg')",
[record.description, slugify(record.description), str(real_price), str(fixed_member_price), record.plu_number, record.description]
)
db.commit()
except:
db.rollback()
db.close()
for the_file in os.listdir(files_to_delete):
file_path = os.path.join(files_to_delete, the_file)
try:
if os.path.isfile(file_path):
os.unlink(file_path)
except Exception, e:
print e
此代码适用于使用普通字符串一次导入数千条记录,但只要导入带有特殊字符,它就不会导入。我认为这是因为描述字段从第19列开始并在80结束,特殊字符添加超过80的字符并且它出错,因为它不能映射其余字段。有没有人知道我可以保留utf-8字符串格式的方法,所以它不会尝试导入'D\xc3\xa9cision financi\xc3\xa8re \xc3\xa0 long trem corrig\xc3\xa9'
?
答案 0 :(得分:2)
是UTF-8字符串。
>>> print 'D\xc3\xa9cision financi\xc3\xa8re \xc3\xa0 long trem corrig\xc3\xa9'.decode('utf-8')
Décision financière à long trem corrigé