Question

我正在编写一个爬虫来提取维基百科上的配方href链接。根据我的实现，如何在到达最后一页之前继续附加链接？注意：下一页链接标题为“下一页200”。

此处列出了链接：http://en.wikibooks.org/wiki/Category:Recipes

def fetch_links(self, proxy):
    """Extracts filtered recipe href links

    Args:
      proxy: The configured proxy address.

    Raises:
      ValueError: If proxy is not a valid address.

    """
    if not self._valid_proxy(proxy):
        raise ValueError('invalid proxy address: {}'.format(proxy))
    self.browser.set_proxies({'http': proxy})
    page = self.browser.open(self.wiki_recipes)
    html = page.read()

    link_tags = SoupStrainer('a', href=True)
    soup = BeautifulSoup(html, parse_only=link_tags)
    recipe_hrefs = r'^\/wiki\/Cookbook:(?!recipes|table_of_contents).*$'
    return [link['href'] for link in soup.find_all(
        'a', href=re.compile(recipe_hrefs, re.IGNORECASE))]

Answer 1

根据我在评论中建议的方法，以下是使用urllib和re的代码示例，这些技术可以在您的代码中重复使用。

创建一个函数并将url作为参数，最初传递start url，并抓取所有配方链接并使用regex追加到全局列表。然后将（下一个200）链接作为参数并调用相同的函数。 try/except用尽用完并导出列表。

由于您的类代码未显示，因此我将跳过所有class和proxy部分，：

#!/usr/bin/python

import urllib
import re

base_url = 'http://en.wikibooks.org/wiki/Category:Recipes'
next_base = 'http://en.wikibooks.org'
recipes = []

# this is just the sample function
# you should handle your proxy logic here too
def get_links(url):
    request = urllib.urlopen(url)
    content = request.read()
    # I just use one-off re expression
    links = re.findall(r'/wiki/Cookbook:(?!Recipes)(?!Table_of_Contents).*" ', content)

    global recipes
    recipes += links

    try:
        # again, one-off re expression
        next_url = re.findall(r'/w/index.*>next 200', content)[0].split('\" ')[0]
        print "fetching next url: " + str(next_base + next_url)
        return get_links(next_base + next_url)
    except IndexError:
        print "all recipes fetched."
        print recipes
        return

if __name__ == '__main__':
    print "start fetching..."
    get_links(base_url)

希望您获得所需的技术。

抓住下一页链接

1 个答案: