如何摆脱'出现而不是撇号的字符?

时间:2011-12-22 17:50:30

标签: python regex screen-scraping web-scraping beautifulsoup

  

可能重复:
  Convert XML/HTML Entities into Unicode String in Python

我正在尝试使用Python抓取网站。我导入并使用urllib2,BeautifulSoup和re modules。

response = urllib2.urlopen(url)
soup = BeautifulSoup(response)
responseString = str(soup)

coarseExpression = re.compile('<div class="sodatext">[\n]*.*[\n]*</div>')
coarseResult = coarseExpression.findall(responseString)

fineExpression = re.compile('<[^>]*>')
fineResult = []

for coarse in coarseResult:
    fine = fineExpression.sub('', coarse) 
    #print(fine)
    fineResult.append(fine)

不幸的是,像撇号这样的字符会像腐败一样出现 - &amp;#x27; 有办法避免这种情况吗?还是一种轻松替换它们的方法?

1 个答案:

答案 0 :(得分:4)

以下关于实体转换的BeautifulSoup文档应该是您正在寻找的内容:

http://www.crummy.com/software/BeautifulSoup/documentation.html#Entity%20Conversion