Question

我正在尝试自动从大量文件中提取数据，并且它在大多数情况下都有效。它遇到非ASCII字符时会崩溃：

UnicodeDecodeError：'ascii'编解码器无法将字节0xc5解码到位 5：序数不在范围内（128）

如何将'品牌'设为UTF-8？我的代码正在被其他东西（使用lxml）重新利用，而且没有任何问题。我已经看过很多关于编码/解码的讨论，但我不明白我应该如何实现它。以下内容仅限于相关代码 - 我已删除其余代码。

i = 0

filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))]

for i in range (len(filenames)):
    pathname = filenames[i]

    fin = open(pathname, 'r')
    with codecs.open(('Assets'+'.log'), mode='w', encoding='utf-8') as f:
        f.write(u'File Path|Brand\n')
        lines = fin.read()
        brand_start = lines.find("Brand Title")
        brand_end = lines.find("/>",brand_start)
        brand = lines [brand_start+47:brand_end-2]
        f.write(u'{}|{}\n'.format(pathname[4:35],brand))

flog.close()

我确信有更好的方法可以编写整个内容，但目前我的重点只是试图了解如何使行/读取函数与UTF-8一起使用。

Answer 1

您正在使用Unicode值混合字节串;你的fin文件对象产生字节串，你在这里将它与Unicode混合：

f.write(u'{}|{}\n'.format(pathname[4:35],brand))

brand是一个字节串，插入到Unicode格式的字符串中。在那里解码brand，或者更好的是，使用io.open()（而不是codecs.open()，这不像新的io模块那样健壮）来管理两者< / em>您的文件：

with io.open('Assets.log', 'w', encoding='utf-8') as f,\ io.open(pathname, encoding='utf-8') as fin: f.write(u'File Path|Brand\n') lines = fin.read() brand_start = lines.find(u"Brand Title") brand_end = lines.find(u"/>", brand_start) brand = lines[brand_start + 47:brand_end - 2] f.write(u'{}|{}\n'.format(pathname[4:35], brand))

您似乎也在手工解析XML文件;也许您想使用ElementTree API来解析这些值。在这种情况下，您打开没有io.open()的文件，因此生成字节字符串，以便XML解析器可以正确地将信息解码为Unicode值。

Answer 2

这是我的最终代码，使用上面的指导。它不漂亮，但它解决了这个问题。我将在以后使用lxml来解决这个问题（因为这是我在使用不同的，更大的xml文件之前遇到过的事情）：

import lxml
import io
import os

from lxml import etree
from glob import glob

nsmap = {'xmlns': 'thisnamespace'}

i = 0

filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))] 

with io.open(('Assets.log'),'w',encoding='utf-8') as f:
    f.write(u'File Path|Series|Brand\n')

    for i in range (len(filenames)):
        pathname = filenames[i]
        parser = lxml.etree.XMLParser()
        tree = lxml.etree.parse(pathname, parser)
        root = tree.getroot()
        fin = open(pathname, 'r')

        with io.open(pathname, encoding='utf-8') as fin:  

            for info in root.xpath('//somepath'):
                series_x = info.find ('./somemorepath')
                series = series_x.get('Asset_Name') if series_x != None else 'Missing'
                lines = fin.read()
                brand_start = lines.find(u"sometext")
                brand_end = lines.find(u"/>",brand_start)
                brand = lines [brand_start:brand_end-2]
                brand = brand[(brand.rfind("/"))+1:]
                f.write(u'{}|{}|{}\n'.format(pathname[5:42],series,brand))

f.close()

现在有人会出现并在一行中完成所有工作！

在Python中使用unicode挣扎

2 个答案: