您好我遇到了编码
的问题当我把字符串放到beautifulsoup丢失所有国家字符
addr = "http://zjazdowa.com.pl/index.php/aktualne-ceny-warzyw-i-owocow-.html"
content = urllib2.urlopen(addr) .read()
html_pag = BeautifulSoup(content) #<- there i lost all national letters
table_html= html_pag.find("div", id="808")
在头文件中我有:
#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup
import urllib2, string, re , sys
reload(sys)
sys.setdefaultencoding("utf-8")
答案 0 :(得分:4)
根据BeautifulSoup的文档,所有输入都在内部转换为UTF8:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("Hello")
soup.contents[0]
# u'Hello'
soup.originalEncoding
# 'ascii'
如果您的输入未指定编码(例如,元标记),BeautifulSoup猜测。您可以通过fromEncoding
参数指定输入的编码来禁用猜测:
soup = BeautifulSoup("hello", fromEncoding="UTF-8")
或者你真正的问题是结果输出到控制台的'破损'吗?
答案 1 :(得分:1)
你的代码完美无缺:
>>> addr = "http://zjazdowa.com.pl/index.php/aktualne-ceny-warzyw-i-owocow-.html"
>>> content = urllib2.urlopen(addr) .read()
>>> html_pag = BeautifulSoup(content) #<- there i lost all national letters
>>> table_html= html_pag.find("div", id="808")
>>> print table_html.findAll('td')[8].string
Kapusta włoska
关于此的几点说明:
#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup
import urllib2, string, re , sys
reload(sys)
sys.setdefaultencoding("utf-8")
reload
重新加载模块。我不确定你希望通过重新加载sys
来做什么,但它不会给你买任何东西。