我下载英文圣经时,我的脚本有效。但是当我下载一本外国圣经时,给我一个ascii错误。
蟒
from BeautifulSoup import BeautifulSoup, Tag, NavigableString
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
print ("downloading and converting Bibles to Aurora...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
namesave = '%s.html' % '.'.join(name.split('.')[:-1])
chnum = name.split('.')[-2]
dirname = urlparse.urlparse(url).path.split('.')[-1]
try:
f = urllib2.urlopen(url)
except urllib2.URLError:
print "Bad URL or timeout"
continue
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
thearticle = soup.html.body.article
bookname = thearticle['data-book-human']
soup.html.replaceWith('<html>'+str(bookname)+'</html>')
converted = str(soup)
full_path = os.path.join(dirname, namesave)
open(full_path, 'wb').write(converted)
print(name)
print("DOWNLOADS AND CONVERSIONS COMPLETE!")
有效的links.html
<a href="http://www.youversion.com/bible/john.6.ceb">http://www.youversion.com/bible/john.6.ceb</a>
links.html提供错误
<a href="http://www.youversion.com/bible/john.6.nav">http://www.youversion.com/bible/john.6.nav</a>
错误
File "test.py", line 32, in <module>
soup.html.replaceWith('<html>'+str(bookname)+'</html>')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
答案 0 :(得分:2)
之前我见过类似的错误,甚至可能是相同的。记不清楚。
尝试:
BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES)
或尝试强制使用unicode:
soup.html.replaceWith(u'<html>'+unicode(bookname)+u'</html>')