无法为re.compile定义正则表达式并将其传递给Beautifulsoup

时间:2015-11-22 13:55:00

标签: regex python-2.7 beautifulsoup

目前我正在练习使用python访问web的基本概念。我正在关注YouTube上的教程,直到以下代码为止。

from urllib2 import urlopen,  HTTPError
from BeautifulSoup import BeautifulSoup
import re


url="http://getbusinessreviews.org/"
try:
   webpage = urlopen(url).read
except HTTPError, e:  
    if e.code == 404:
        e.msg = 'data not found on remote: %s' % e.msg
    raise
pathFinderTitle = re.compile('<h2 class="entry-title"><a href.* rel="bookmark">(.*)</a></h2>')
if  webpage:
    if pathFinderTitle:
        findPathTitle = re.findall(pathFinderTitle,webpage)
    else:
        print "unable to get path finder title"

else:
    print "unable to url open "
listIterator =[]
listIterator[:]= range(2,10)

for i in listIterator:
    print findPathTitle[i]

我想从以下HTML中提取“Nutracoster”

        <h2 class="entry-title">

            <a href="http://getbusinessreviews.org/nutracoster/" rel="bookmark">Nutracoster</a>

        </h2>

我有两个问题

  1. 我现在没有结果,任何人都可以指导我做错了吗?(我想我的正则表达式没有明确定义)

  2. 如何将此正则表达式传递给Beautifulsoup?

  3. 提前致谢并对自从我处于学习阶段后的任何愚蠢错误表示抱歉:D

1 个答案:

答案 0 :(得分:1)

您不需要使用正则表达式来选择具有Beautiful Soup的元素:它可以自行提取具有特定属性的所有<h2>标记。

此外,最好不要使用正则表达式来解析HTML(请参阅此popular question)。

试试这段代码:

from bs4 import BeautifulSoup as BS
from urllib2 import urlopen, HTTPError, URLError

url = "http://getbusinessreviews.org/"
try:
    webpage = urlopen(url)
except HTTPError, e:
    if e.code == 404:
        e.msg = 'data not found on remote: %s' % e.msg
    raise
except URLError, e:
    print e.args

soup = BS(webpage, 'lxml')

## Relevant lines ##
for h2 in soup.find_all("h2", attrs={"class": "entry-title"}):
    print h2.text