Question

我正在尝试访问名为scopus.com的网站。我想要做的是搜索其中的作者并获得他的出版物数量，h-index等。如果您不在大学的无线网络上，则无法访问此网站（无论何时我想在家中访问，我都会使用VPN。）

以下是代码：

import urllib

first_name = "John"
last_name = "Smith"

new_url = "http://www.scopus.com/results/authorNamesList.url?sort=\
count-f&src=al&sid=66892931B99391BF99AFADC3006D1357.WXhD7YyTQ6A7Pvk9AlA%3a50\
&sot=al&sdt=al&sl=47&s=AUTH--LAST--NAME%28" + last_name + \
"%29+AND+AUTH--FIRST%28" + first_name + "%29&st1=" + last_name + "&st2=" + first_name +\
"&orcidId=&selectionPageSearch=anl&reselectAuthor=false&activeFlag=false&showDocument=\
false&resultsPerPage=20&offset=1&jtp=false&currentPage=1&previousSelectionCount=\
0&tooManySelections=false&previousResultCount=0&authSubject=LFSC&authSubject=\
HLSC&authSubject=PHSC&authSubject=SOSC&exactAuthorSearch=false&showFullList=\
false&authorPreferredName=&origin=searchauthorlookup&affiliationId=&txGid=\
66892931B99391BF99AFADC3006D1357.WXhD7YyTQ6A7Pvk9AlA%3a5"

page_source = urllib.urlopen(new_url).read()

print page_source

无论我做什么，我总是会收到这个错误：

File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 386, in http_error_default
raise IOError, ('http error', errcode, errmsg, headers)

IOError: ('http error', 401, 'Unauthorized', <httplib.HTTPMessage instance at 0x102c85a28>)

我在这个论坛上花了一些时间，我想我已经尝试了所有我能找到的东西（包括假装以Opera身份访问网站）。无论如何我可以做到这一点，还是我应该放弃并手动完成这700次？

提前感谢大家的帮助

Answer 1

这与您的VPN无关。主要问题是您正在尝试获取一个必须具有有效会话的页面（该页面存在于浏览器的请求 - 响应中）。你的选择：

使用Mechanize
使用Requests

但在任何情况下，我都会邀请您使用API来解决此类问题：Elsevier API。

Answer 2

非常简单，401 Error表示您未经授权（通常，必须登录才能访问该网站）。也就是说，你正在做的是expressly prohibited based on their robots.txt file，所以我建议你不要坚持。

话虽如此，如果您继续对抓取其他网站感兴趣，我会说您应该查看Python Requests Module以及Beautiful Soup。

使用VPN的Python urllib.urlopen IOError

2 个答案: