Python编码与beautifulsoup的问题

时间:2011-02-23 08:59:59

标签: python encoding utf-8 ascii beautifulsoup

您好我遇到了编码

的问题

当我把字符串放到beautifulsoup丢失所有国家字符

addr = "http://zjazdowa.com.pl/index.php/aktualne-ceny-warzyw-i-owocow-.html"                                
content = urllib2.urlopen(addr) .read()
html_pag = BeautifulSoup(content) #<- there i lost all national letters 
table_html= html_pag.find("div",  id="808") 

在头文件中我有:

#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup
import urllib2, string, re , sys
reload(sys)
sys.setdefaultencoding("utf-8")

2 个答案:

答案 0 :(得分:4)

根据BeautifulSoup的文档,所有输入都在内部转换为UTF8:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("Hello")
soup.contents[0]
# u'Hello'
soup.originalEncoding
# 'ascii'

如果您的输入未指定编码(例如,元标记),BeautifulSoup猜测。您可以通过fromEncoding参数指定输入的编码来禁用猜测:

soup = BeautifulSoup("hello", fromEncoding="UTF-8")

或者你真正的问题是结果输出到控制台的'破损'吗?

答案 1 :(得分:1)

你的代码完美无缺:

>>> addr = "http://zjazdowa.com.pl/index.php/aktualne-ceny-warzyw-i-owocow-.html"                                
>>> content = urllib2.urlopen(addr) .read()
>>> html_pag = BeautifulSoup(content) #<- there i lost all national letters 
>>> table_html= html_pag.find("div",  id="808")
>>> print table_html.findAll('td')[8].string
Kapusta włoska

关于此的几点说明:

#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup
import urllib2, string, re , sys
reload(sys)
sys.setdefaultencoding("utf-8")

reload重新加载模块。我不确定你希望通过重新加载sys来做什么,但它不会给你买任何东西。