目前我正在练习使用python访问web的基本概念。我正在关注YouTube上的教程,直到以下代码为止。
from urllib2 import urlopen, HTTPError
from BeautifulSoup import BeautifulSoup
import re
url="http://getbusinessreviews.org/"
try:
webpage = urlopen(url).read
except HTTPError, e:
if e.code == 404:
e.msg = 'data not found on remote: %s' % e.msg
raise
pathFinderTitle = re.compile('<h2 class="entry-title"><a href.* rel="bookmark">(.*)</a></h2>')
if webpage:
if pathFinderTitle:
findPathTitle = re.findall(pathFinderTitle,webpage)
else:
print "unable to get path finder title"
else:
print "unable to url open "
listIterator =[]
listIterator[:]= range(2,10)
for i in listIterator:
print findPathTitle[i]
我想从以下HTML中提取“Nutracoster”
<h2 class="entry-title">
<a href="http://getbusinessreviews.org/nutracoster/" rel="bookmark">Nutracoster</a>
</h2>
我有两个问题
我现在没有结果,任何人都可以指导我做错了吗?(我想我的正则表达式没有明确定义)
如何将此正则表达式传递给Beautifulsoup?
提前致谢并对自从我处于学习阶段后的任何愚蠢错误表示抱歉:D
答案 0 :(得分:1)
您不需要使用正则表达式来选择具有Beautiful Soup的元素:它可以自行提取具有特定属性的所有<h2>
标记。
此外,最好不要使用正则表达式来解析HTML(请参阅此popular question)。
试试这段代码:
from bs4 import BeautifulSoup as BS
from urllib2 import urlopen, HTTPError, URLError
url = "http://getbusinessreviews.org/"
try:
webpage = urlopen(url)
except HTTPError, e:
if e.code == 404:
e.msg = 'data not found on remote: %s' % e.msg
raise
except URLError, e:
print e.args
soup = BS(webpage, 'lxml')
## Relevant lines ##
for h2 in soup.find_all("h2", attrs={"class": "entry-title"}):
print h2.text