Question

我有一个xml文件目录，其中xml文件的格式为：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
  <document>
    <sentences>
      <sentence id="1">
        <tokens>
          <token id="1">
            <word>Brand</word>
            <lemma>brand</lemma>
            <CharacterOffsetBegin>0</CharacterOffsetBegin>
            <CharacterOffsetEnd>5</CharacterOffsetEnd>
            <POS>NN</POS>
            <NER>O</NER>
          </token>
          <token id="2">
            <word>Blogs</word>
            <lemma>blog</lemma>
            <CharacterOffsetBegin>6</CharacterOffsetBegin>
            <CharacterOffsetEnd>11</CharacterOffsetEnd>
            <POS>NNS</POS>
            <NER>O</NER>
          </token>
          <token id="3">
            <word>Capture</word>
            <lemma>capture</lemma>
            <CharacterOffsetBegin>12</CharacterOffsetBegin>
            <CharacterOffsetEnd>19</CharacterOffsetEnd>
            <POS>VBP</POS>
            <NER>O</NER>
          </token>

我正在解析每个xml文件并在标签之间存储这个单词，然后找到前100个单词。

我这样做：

def find_top_words(xml_directory):
    file_list = []
    temp_list=[]
    file_list2=[]
    for dir_file in os.listdir(xml_directory):
        dir_file_path = os.path.join(xml_directory, dir_file)
        if os.path.isfile(dir_file_path):
            with open(dir_file_path) as f:
                page = f.read()
                soup = BeautifulSoup(page,"xml")
                for word in soup.find_all('word'):
                    file_list.append(str(word.string.strip()))
            f.close()
    for element in file_list:
        s = element.lower()
        file_list2.append(s)
    counts = Counter(file_list2)
    for w in sorted(counts, key=counts.get, reverse=True):
          temp_list.append(w)
    return temp_list[:100]

但是，我收到了这个错误：

File "prac31.py", line 898, in main
    v = find_top_words('/home/xyz/xml_dir')
  File "prac31.py", line 43, in find_top_words
    file_list.append(str(word.string.strip()))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 2: ordinal not in range(128)

这是什么意思以及如何解决？

Answer 1

不要使用BeautifulSoup，它已被完全弃用。为什么不是标准的lib？如果你想要一些更复杂的xml处理你有lxml（但我很确定你没有）

它可以轻松解决您的问题。

编辑：忘记预览答案它是坏的-_- 你的问题是python 2中的str（my_string）如果my_string包含非ascii字符，导致unicode字符串的python 2中的str（）就像尝试编码为ascii一样，请使用方法encode（'utf-8'）代替。

Answer 2

Str（）函数编码ascii编解码器，因为你的word.string.strip()没有返回ascii字符，你的xml文件中的某些地方就会出现这个错误。解决方案正在使用：

file_list.append(word.string.strip().encode('utf-8'))

并且要返回此值，您需要执行以下操作：

for item in file_list:
    print item.decode('utf-8')

希望它有所帮助。

Answer 3

在这行代码中：

file_list.append(str(word.string.strip()))

为什么使用str？数据是Unicode，您可以将unicode字符串附加到列表中。如果您需要字节字符串，则可以使用word.string.strip().encode('utf8')代替。

解析xml文件时出现unicode错误

3 个答案: