使用re进行Web Scraping:如何忽略我们想要提取的文本中的html标签?

时间:2014-08-04 11:42:30

标签: python html web web-scraping

我想在html源代码中的 <p><span class="step_leadin"> and </p> 之间打印文本。我使用了以下代码。

import urllib2
import re
url = ['http://recipes.latimes.com/recipe-restaurant-1833s-bacon-cheddar-biscuits-maple-chile-butter/']
htmlfile = urllib2.urlopen('http://recipes.latimes.com/recipe-restaurant-1833s-bacon-cheddar-biscuits-maple-chile-butter/')
htmltext = htmlfile.read()
regex2 =  '<p><span class="step_leadin">(.+?)</p>'
pattern2 = re.compile(regex2)
method = re.findall(pattern2,htmltext)
print method

我想提取的html部分是。

<p><span class="step_leadin">Step1</span>Carefully transfer the biscuits to a rimmed baking sheet, spacing them an inch or so apart</p>

问题在于,当我使用“print method”命令时,它还会在这两个标签之间提供所有文本,包括“</span>”。但我不希望</span>在输出中打印出来。在提取我想要的文本时,有没有办法忽略标记。

2 个答案:

答案 0 :(得分:1)

我强烈建议您不要使用正则表达式来解析html,因为html is not regular.而是使用像BeautifulSouplxml.这样的HTML / xml解析器。以下是您尝试使用的示例使用BeautifulSoup:

from bs4 import BeautifulSoup

html = '<p><span class="step_leadin">Step1</span>Carefully transfer the biscuits to a rimmed baking sheet, spacing them an inch or so apart</p>'

bs = BeautifulSoup(html)

for p in bs.find_all('p'):
    print p.text

答案 1 :(得分:0)

我相信heinst的答案更好,但既然你坚持使用正则表达式,你可以这样做:

import re

html = '<p><span class="step_leadin">Step1</span>Carefully transfer the biscuits to a rimmed baking sheet, spacing them an inch or so apart</p>'

print re.sub(r'<[^>]*?>', '', html)