我的代码有问题iv尝试了一切,但仍然没有,所以我认为id来到这个社区并尝试获得答案
def parse_html(filename):
"""Extract the Author, Title and Text from a HTML file
which was produced by pdftotext with the option -htmlmeta."""
parse_html函数返回一个字典,该字典由索引模式中某些字段的内容组成
def pdftotext(pdf):
""" this code is very long so im going to post only where the
error occures"""
data = parse_html(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data', basename + '.html'))
with open(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data1', basename + '.txt'), 'w') as outfile:
outfile.write(data ['text'])
return data
outfile.write之后有更多的数据,这没关系。我试图将函数parse_html插入pdftotext函数,然后将文本字段的内容写入.txt文件,我得到此错误
<ipython-input-7-dc9e4ae8fd27> in pdftotext(pdf)
37 data = parse_html(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data', basename + '.html'))
38 with open(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data1', basename + '.txt'), 'w') as outfile:
---> 39 outfile.write(data ['text']) <----------- this is the error
40
41 os.remove(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data', basename + '.html'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 108: ordinal not in range(128)