我正在编写一个爬虫来提取维基百科上的配方href链接。根据我的实现,如何在到达最后一页之前继续附加链接?注意:下一页链接标题为“下一页200”。
此处列出了链接:http://en.wikibooks.org/wiki/Category:Recipes
def fetch_links(self, proxy):
"""Extracts filtered recipe href links
Args:
proxy: The configured proxy address.
Raises:
ValueError: If proxy is not a valid address.
"""
if not self._valid_proxy(proxy):
raise ValueError('invalid proxy address: {}'.format(proxy))
self.browser.set_proxies({'http': proxy})
page = self.browser.open(self.wiki_recipes)
html = page.read()
link_tags = SoupStrainer('a', href=True)
soup = BeautifulSoup(html, parse_only=link_tags)
recipe_hrefs = r'^\/wiki\/Cookbook:(?!recipes|table_of_contents).*$'
return [link['href'] for link in soup.find_all(
'a', href=re.compile(recipe_hrefs, re.IGNORECASE))]
答案 0 :(得分:0)
根据我在评论中建议的方法,以下是使用urllib
和re
的代码示例,这些技术可以在您的代码中重复使用。
创建一个函数并将url作为参数,最初传递start url,并抓取所有配方链接并使用regex追加到全局列表。然后将(下一个200)链接作为参数并调用相同的函数。 try/except
用尽用完并导出列表。
由于您的类代码未显示,因此我将跳过所有class
和proxy
部分, :
#!/usr/bin/python
import urllib
import re
base_url = 'http://en.wikibooks.org/wiki/Category:Recipes'
next_base = 'http://en.wikibooks.org'
recipes = []
# this is just the sample function
# you should handle your proxy logic here too
def get_links(url):
request = urllib.urlopen(url)
content = request.read()
# I just use one-off re expression
links = re.findall(r'/wiki/Cookbook:(?!Recipes)(?!Table_of_Contents).*" ', content)
global recipes
recipes += links
try:
# again, one-off re expression
next_url = re.findall(r'/w/index.*>next 200', content)[0].split('\" ')[0]
print "fetching next url: " + str(next_base + next_url)
return get_links(next_base + next_url)
except IndexError:
print "all recipes fetched."
print recipes
return
if __name__ == '__main__':
print "start fetching..."
get_links(base_url)
希望您获得所需的技术。