可能重复:
Convert XML/HTML Entities into Unicode String in Python
我正在尝试使用Python抓取网站。我导入并使用urllib2,BeautifulSoup和re modules。
response = urllib2.urlopen(url)
soup = BeautifulSoup(response)
responseString = str(soup)
coarseExpression = re.compile('<div class="sodatext">[\n]*.*[\n]*</div>')
coarseResult = coarseExpression.findall(responseString)
fineExpression = re.compile('<[^>]*>')
fineResult = []
for coarse in coarseResult:
fine = fineExpression.sub('', coarse)
#print(fine)
fineResult.append(fine)
不幸的是,像撇号这样的字符会像腐败一样出现 - &amp;#x27; 有办法避免这种情况吗?还是一种轻松替换它们的方法?
答案 0 :(得分:4)
以下关于实体转换的BeautifulSoup文档应该是您正在寻找的内容:
http://www.crummy.com/software/BeautifulSoup/documentation.html#Entity%20Conversion