我有以下输入XML文件,我读取了rel_notes标记并打印出来......运行到以下错误
输入XML:
<rel_notes>
• Please move to this build for all further test and development activities
• Please use this as base build to verify compilation and sanity before any check-in happens
</rel_notes>
示例python代码:
file = open('data.xml,'r')
from xml.etree import cElementTree as etree
tree = etree.parse(file)
print('\n'.join(elem.text for elem in tree.iter('rel_notes')))
输出
print('\n'.join(elem.text for elem in tree.iter('rel_notes')))
File "C:\python2.7.3\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2022' in position 9: character maps to <undefined>
答案 0 :(得分:1)
问题在于printing Unicode to Windows console。即,the character '•'无法在您的控制台使用的cp437
中表示。
要重现此问题,请尝试:
print u'\u2022'
你可以设置PYTHONIOENCODING
environment variable来指示python用相应的xml char引用替换所有不可代表的字符:
T:\> set PYTHONIOENCODING=cp437:xmlcharrefreplace
T:\> python your_script.py
或者在打印前将文本编码为字节:
print u'\u2022'.encode('cp437', 'xmlcharrefreplace')
回答您的初步问题
To print text of each <build_location/>
element:
import sys
from xml.etree import cElementTree as etree
input_file = sys.stdin # filename or file object
tree = etree.parse(input_file)
print('\n'.join(elem.text for elem in tree.iter('build_location')))
如果输入文件很大; iterparse()
could be used:
import sys
from xml.etree import cElementTree as etree
input_file = sys.stdin
context = iter(etree.iterparse(input_file, events=('start', 'end')))
_, root = next(context) # get root element
for event, elem in context:
if event == 'end' and elem.tag == 'build_location':
print(elem.text)
root.clear() # free memory
答案 1 :(得分:0)
我不认为上面的整个代码段完全有用。但是,当ASCII字符处理不当时,通常会发生UnicodeEncodeError。
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")
在这个答案中已经清楚地解释了它:Python: Convert Unicode to ASCII without errors
这应该至少解决UnicodeEncodeError。