Question

我想在html源代码中的  and  之间打印文本。我使用了以下代码。

import urllib2
import re
url = ['http://recipes.latimes.com/recipe-restaurant-1833s-bacon-cheddar-biscuits-maple-chile-butter/']
htmlfile = urllib2.urlopen('http://recipes.latimes.com/recipe-restaurant-1833s-bacon-cheddar-biscuits-maple-chile-butter/')
htmltext = htmlfile.read()
regex2 =  '<p><span class="step_leadin">(.+?)</p>'
pattern2 = re.compile(regex2)
method = re.findall(pattern2,htmltext)
print method

我想提取的html部分是。

<p><span class="step_leadin">Step1</span>Carefully transfer the biscuits to a rimmed baking sheet, spacing them an inch or so apart</p>

问题在于，当我使用“print method”命令时，它还会在这两个标签之间提供所有文本，包括“”。但我不希望在输出中打印出来。在提取我想要的文本时，有没有办法忽略标记。

Answer 1

我强烈建议您不要使用正则表达式来解析html，因为html is not regular.而是使用像BeautifulSoup或lxml.这样的HTML / xml解析器。以下是您尝试使用的示例使用BeautifulSoup：

from bs4 import BeautifulSoup

html = '<p><span class="step_leadin">Step1</span>Carefully transfer the biscuits to a rimmed baking sheet, spacing them an inch or so apart</p>'

bs = BeautifulSoup(html)

for p in bs.find_all('p'):
    print p.text

Answer 2

我相信heinst的答案更好，但既然你坚持使用正则表达式，你可以这样做：

import re

html = '<p><span class="step_leadin">Step1</span>Carefully transfer the biscuits to a rimmed baking sheet, spacing them an inch or so apart</p>'

print re.sub(r'<[^>]*?>', '', html)

使用re进行Web Scraping：如何忽略我们想要提取的文本中的html标签？

2 个答案: