我写了一篇网页搜索来提取谷歌学者的信息。但是,任何方便的工具(如urllib2 pr请求)都会失败。它给了我503错误代码。
我正在寻找另一种提取信息的方法。可能我可以让程序在浏览中打开URL而不是提取信息。
例如,它是一个链接:
'http://scholar.google.com/citations?user=lTCxlGYAAAAJ&hl=en'
如何继续获取H-index等?
答案 0 :(得分:0)
Google学术搜索似乎暂时禁止客户端(使用503错误代码)进行频繁查询或看似自动化。您可能经常在查询后被禁止,或者因为它认为您是从脚本运行的。您可以使用cookie在单个会话中执行多个查询。或者等到禁令解除,或者在尝试之间等待,或者将脚本编写为来自Web浏览器(更改它在查询中发送的'userAgent'字符串)。
对“google scholar 503”进行谷歌搜索,获取有关此主题的大量信息(这就是我所做的一切)。
另见本主题:503 error when trying to access Google Patents using python
答案 1 :(得分:0)
可能您收到了 503 response code
,因为 Google 检测到您的脚本是发送自动请求的脚本。您始终可以打印响应代码文本以查看发生了什么。这可能是每 X 时间的请求数限制,或者其他什么。
为了避免这种情况,您可以尝试的第一件事是使用代理。
在 online IDE(bs4 文件夹 -> get_citedby_public_access.py
)中抓取表(包括图形)或测试引用的整个代码:
from bs4 import BeautifulSoup
import requests, lxml, os, json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
proxies = {
'http': os.getenv('HTTP_PROXY')
}
html = requests.get('https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')
# Cited by and public access results
for cited_by_public_access in soup.select('.gsc_rsb'):
citations_all = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std').text
citations_since2016 = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std').text
h_index_all = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std').text
h_index_2016 = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std').text
i10_index_all = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std').text
i10_index_2016 = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std').text
articles_num = cited_by_public_access.select_one('.gsc_rsb_m_a:nth-child(1) span').text.split(' ')[0]
articles_link = cited_by_public_access.select_one('#gsc_lwp_mndt_lnk')['href']
print('Citiation info:')
print(f'{citations_all}\n{citations_since2016}\n{h_index_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n{articles_num}\nhttps://scholar.google.com{articles_link}\n')
# Graph results
years = [graph_year.text for graph_year in soup.select('.gsc_g_t')]
citations = [graph_citation.text for graph_citation in soup.select('.gsc_g_a')]
data = []
for year, citation in zip(years,citations):
# Basic prints
print(f'{year} {citation}\n')
data.append({
'year': year,
'citation': citation,
})
# JSON output, if needed
print(json.dumps(data, indent=2))
部分输出:
Citation info:
3208
2184
21
21
28
23
2
https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=8Cuk5vYAAAAJ
# Portion of the regular output
2007 24
2008 30
2009 46
# Portion of JSON
[
{
"year": "2007",
"citation": "24"
},
{
"year": "2008",
"citation": "30"
}
]
或者,您可以使用来自 SerpApi 的 Google Scholar Author Cited By API。这是一个付费 API,可免费试用 5,000 次搜索。
它和上面的代码做同样的事情,只是你不必避免阻塞和维护解析器。
要集成的代码:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_author",
"author_id": "m8dFEawAAAAJ",
}
search = GoogleSearch(params)
results = search.get_dict()
# Cited By and public access results
citations_all = results['cited_by']['table'][0]['citations']['all']
citations_2016 = results['cited_by']['table'][0]['citations']['since_2016']
h_inedx_all = results['cited_by']['table'][1]['h_index']['all']
h_index_2016 = results['cited_by']['table'][1]['h_index']['since_2016']
i10_index_all = results['cited_by']['table'][2]['i10_index']['all']
i10_index_2016 = results['cited_by']['table'][2]['i10_index']['since_2016']
print(f'{citations_all}\n{citations_2016}\n{h_inedx_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n')
public_access_link = results['public_access']['link']
public_access_available_articles = results['public_access']['available']
print(f'{public_access_link}\n{public_access_available_articles}\n')
# Graph results
for graph_results in results['cited_by']['graph']:
year = graph_results['year']
citations = graph_results['citations']
print(f'{year} {citations}\n')
部分输出:
946
563
17
12
27
18
https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=m8dFEawAAAAJ
23
2004 6
2005 20
2006 11
<块引用>
免责声明,我为 SerpApi 工作。