如何从HTML字符串中提取内容

时间:2013-05-17 06:22:20

标签: python scrapy

我想从DIV标签中提取内容。我使用scrapy来废弃一些网站,但问题是相同的DIV标签有两种类型的内容:

["<div class=\"price\">\n                <s>Rs.330</s> <b>Rs.297</b>\n                              </div>"]

并且

["<div class=\"price\">\n                Rs.330              \n</div>"] 

如何从此标记中提取内容?

1 个答案:

答案 0 :(得分:2)

使用BeautifulSoup

import bs4

html = "<div class=\"price\">\n                <s>Rs.330</s> <b>Rs.297</b>\n                              </div>"
soup = bs4.BeautifulSoup(html, features="xml")
s = soup.div.s.text # u'Rs.330'
b = soup.div.b.text # u'Rs.297'