我正在尝试从Google学术搜索的搜索结果中删除PDF链接。我试图根据URL的变化设置页面计数器,但在前八个输出链接之后,我得到重复的链接作为输出。
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
import requests
#modifying the url as per page
urlCounter = 0
while urlCounter <=30:
urlPart1 = "http://scholar.google.com/scholar?start="
urlPart2 = "&q=%22entity+resolution%22&hl=en&as_sdt=0,4"
url = urlPart1 + str(urlCounter) + urlPart2
page = urllib2.Request(url,None,{"User-Agent":"Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"})
resp = urllib2.urlopen(page)
html = resp.read()
soup = BeautifulSoup(html)
urlCounter = urlCounter + 10
recordCount = 0
while recordCount <=9:
recordPart1 = "gs_ggsW"
finRecord = recordPart1 + str(recordCount)
recordCount = recordCount+1
#printing the links
for link in soup.find_all('div', id = finRecord):
linkstring = str(link)
soup1 = BeautifulSoup(linkstring)
for link in soup1.find_all('a'):
print(link.get('href'))
答案 0 :(得分:1)
更改代码中的以下行:
finRecord = recordPart1 + str(recordCount)
要
finRecord = recordPart1 + str(recordCount+urlCounter-10)
真正的问题:第一页中的div ID是gs_ggsW [0-9],但第二页上的ID是gs_ggsW [10-19]。如此美丽的汤将在第二页找不到链接。
Python的变量范围可能会使人们与其他语言混淆,例如Java。在执行下面的for循环之后,变量link
仍然存在。所以链接引用了第一页上的最后一个链接。
for link in soup1.find_all('a'):
print(link.get('href'))
更新
Google可能不会为某些论文提供pdf下载链接,因此您无法使用ID来匹配每篇论文的链接。您可以使用css选择器将所有链接匹配在一起。
soup = BeautifulSoup(html)
urlCounter = urlCounter + 10
for link in soup.select('div.gs_ttss a'):
print(link.get('href'))