关注Google Results Python Mechanize的前5页

时间:2013-08-11 01:26:01

标签: python conditional mechanize mechanize-python

我目前正在抓取谷歌搜索结果的第一页,但我想抓住前5页。

获取一个字符串:https://encrypted.google.com/search?hl=en&q=site%3Asomedomain.com&start=0

变量urls获取第一页的所有10个结果,但我开始添加条件以检查第一页上的10个网址,如果这是真的并且有10个网址,我希望它继续到下一个网址,例如(如果下一个网址也有10个结果),请使用follow_link()和以下网址:

https://encrypted.google.com/search?hl=en&q=site%3Asomedomain.com&start=10
https://encrypted.google.com/search?hl=en&q=site%3Asomedomain.com&start=20
https://encrypted.google.com/search?hl=en&q=site%3Asomedomain.com&start=30
https://encrypted.google.com/search?hl=en&q=site%3Asomedomain.com&start=40
https://encrypted.google.com/search?hl=en&q=site%3Asomedomain.com&start=50

我该怎么做呢?有人可以帮帮我吗?

1 个答案:

答案 0 :(得分:2)

您可以使用BeautifulSoup定位带有下一页链接的元素:

from mechanize import Browser
from bs4 import BeautifulSoup

br = Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.2;\
                    WOW64) AppleWebKit/537.11 (KHTML, like Gecko)\
                    Chrome/23.0.1271.97 Safari/537.11')]

url = "https://encrypted.google.com/search?hl=en&q=site%3Asomedomain.com&start=0"

r = br.open(url)

soup = BeautifulSoup(r)

nextpage = soup.find("a", {"id": "pnnext"})
print nextpage['href']

输出:

/search?q=site:somedomain.com&hl=en&ei=NJ4HUo2yM-TK4ATJlYGICQ&start=10&sa=N

所以现在你有了下一页的链接。如果找不到元素,那么它就是最后一页