Python BeautifulSoup - 循环遍历多个页面

时间:2012-04-26 19:43:04

标签: python web-scraping beautifulsoup

我试图首先从页面抓取所有链接,当获取“下一步”按钮的URL并保持循环直到没有更多页面。一直试图让嵌套循环实现,但由于某种原因,BeautifulSoup永远不解析第二页..只有第一页然后停止..

很难解释,但这里的代码应该更容易理解我想要解释的内容:)

#this site holds the first page that it should start looping on.. from this page i want to reach page 2, 3, etc.
   webpage = urlopen('www.first-page-with-urls-and-next-button.com').read()

soup = BeautifulSoup(webpage)

for tag in soup.findAll('a', { "class" : "next" }):

    print tag['href']
    print "\n--------------------\n"


#next button is relative url so append it to main-url.com
    soup = BeautifulSoup('http://www.main-url.com/'+ re.sub(r'\s', '', tag['href']))

#for some reason this variable only holds the tag['href']
    print soup

    for taggen in soup.findAll('a', { "class" : "homepage target-blank" }):
        print tag['href']

        # Read page found
        sidan = urlopen(taggen['href']).read()

# get title
        Titeln = re.findall(patFinderTitle, sidan)

        print Titeln

有什么想法吗?很抱歉英语不好,我希望我不会受到打击:)请问我是否解释得不好我会尽力解释一下。哦,我是Python的新手 - 截至今天(正如你可能已经想到的那样)。

2 个答案:

答案 0 :(得分:2)

如果您在新网址上调用urlopen并将生成的文件对象传递给BeatifulSoup,我想您将全部设置完毕。那就是:

wepage = urlopen(http://www.main-url.com/'+ re.sub(r'\s', '', tag['href']))
soup = BeautifulSoup(webpage)

答案 1 :(得分:0)

对于这一行:

soup = BeautifulSoup('http://www.main-url.com/'+ re.sub(r'\s', '', tag['href']))

尝试:

webpage = urlopen('http://www.main-url.com/'+re.sub(r'\s','',tag['href'])).read()

soup = BeautifulSoup(webpage)