我想在html源代码中的 <p><span class="step_leadin"> and </p>
之间打印文本。我使用了以下代码。
import urllib2
import re
url = ['http://recipes.latimes.com/recipe-restaurant-1833s-bacon-cheddar-biscuits-maple-chile-butter/']
htmlfile = urllib2.urlopen('http://recipes.latimes.com/recipe-restaurant-1833s-bacon-cheddar-biscuits-maple-chile-butter/')
htmltext = htmlfile.read()
regex2 = '<p><span class="step_leadin">(.+?)</p>'
pattern2 = re.compile(regex2)
method = re.findall(pattern2,htmltext)
print method
我想提取的html部分是。
<p><span class="step_leadin">Step1</span>Carefully transfer the biscuits to a rimmed baking sheet, spacing them an inch or so apart</p>
问题在于,当我使用“print method”命令时,它还会在这两个标签之间提供所有文本,包括“</span>
”。但我不希望</span>
在输出中打印出来。在提取我想要的文本时,有没有办法忽略标记。
答案 0 :(得分:1)
我强烈建议您不要使用正则表达式来解析html,因为html is not regular.而是使用像BeautifulSoup或lxml.这样的HTML / xml解析器。以下是您尝试使用的示例使用BeautifulSoup:
from bs4 import BeautifulSoup
html = '<p><span class="step_leadin">Step1</span>Carefully transfer the biscuits to a rimmed baking sheet, spacing them an inch or so apart</p>'
bs = BeautifulSoup(html)
for p in bs.find_all('p'):
print p.text
答案 1 :(得分:0)
我相信heinst的答案更好,但既然你坚持使用正则表达式,你可以这样做:
import re
html = '<p><span class="step_leadin">Step1</span>Carefully transfer the biscuits to a rimmed baking sheet, spacing them an inch or so apart</p>'
print re.sub(r'<[^>]*?>', '', html)