如何在Python中打开URL并提取信息

时间:2016-11-08 21:52:48

标签: python scrape

我写了一篇网页搜索来提取谷歌学者的信息。但是,任何方便的工具(如urllib2 pr请求)都会失败。它给了我503错误代码。

我正在寻找另一种提取信息的方法。可能我可以让程序在浏览中打开URL而不是提取信息。

例如,它是一个链接:

'http://scholar.google.com/citations?user=lTCxlGYAAAAJ&hl=en'

如何继续获取H-index等?

2 个答案:

答案 0 :(得分:0)

Google学术搜索似乎暂时禁止客户端(使用503错误代码)进行频繁查询或看似自动化。您可能经常在查询后被禁止,或者因为它认为您是从脚本运行的。您可以使用cookie在单个会话中执行多个查询。或者等到禁令解除,或者在尝试之间等待,或者将脚本编写为来自Web浏览器(更改它在查询中发送的'userAgent'字符串)。

对“google scholar 503”进行谷歌搜索,获取有关此主题的大量信息(这就是我所做的一切)。

另见本主题:503 error when trying to access Google Patents using python

答案 1 :(得分:0)

可能您收到了 503 response code,因为 Google 检测到您的脚本是发送自动请求的脚本。您始终可以打印响应代码文本以查看发生了什么。这可能是每 X 时间的请求数限制,或者其他什么。

为了避免这种情况,您可以尝试的第一件事是使用代理。

online IDE(bs4 文件夹 -> get_citedby_public_access.py)中抓取表(包括图形)或测试引用的整个代码:

from bs4 import BeautifulSoup
import requests, lxml, os, json

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

proxies = {
  'http': os.getenv('HTTP_PROXY')
}

html = requests.get('https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')

# Cited by and public access results
for cited_by_public_access in soup.select('.gsc_rsb'):
  citations_all = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std').text
  citations_since2016 = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std').text
  h_index_all = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std').text
  h_index_2016 = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std').text
  i10_index_all = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std').text
  i10_index_2016 = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std').text
  articles_num = cited_by_public_access.select_one('.gsc_rsb_m_a:nth-child(1) span').text.split(' ')[0]
  articles_link = cited_by_public_access.select_one('#gsc_lwp_mndt_lnk')['href']
  
  print('Citiation info:')
  print(f'{citations_all}\n{citations_since2016}\n{h_index_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n{articles_num}\nhttps://scholar.google.com{articles_link}\n')

# Graph results
years = [graph_year.text for graph_year in soup.select('.gsc_g_t')]
citations = [graph_citation.text for graph_citation in soup.select('.gsc_g_a')]

data = []

for year, citation in zip(years,citations):
  # Basic prints
  print(f'{year} {citation}\n')

  data.append({
    'year': year,
    'citation': citation,
  })

# JSON output, if needed
print(json.dumps(data, indent=2))

部分输出:

Citation info:
3208
2184
21
21
28
23
2
https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=8Cuk5vYAAAAJ

# Portion of the regular output
2007 24

2008 30

2009 46

# Portion of JSON
[
  {
    "year": "2007",
    "citation": "24"
  },
  {
    "year": "2008",
    "citation": "30"
  }
]

或者,您可以使用来自 SerpApi 的 Google Scholar Author Cited By API。这是一个付费 API,可免费试用 5,000 次搜索。

它和上面的代码做同样的事情,只是你不必避免阻塞和维护解析器。

要集成的代码:

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google_scholar_author",
  "author_id": "m8dFEawAAAAJ",
}

search = GoogleSearch(params)
results = search.get_dict()

# Cited By and public access results
citations_all = results['cited_by']['table'][0]['citations']['all']
citations_2016 = results['cited_by']['table'][0]['citations']['since_2016']
h_inedx_all = results['cited_by']['table'][1]['h_index']['all']
h_index_2016 = results['cited_by']['table'][1]['h_index']['since_2016']
i10_index_all = results['cited_by']['table'][2]['i10_index']['all']
i10_index_2016 = results['cited_by']['table'][2]['i10_index']['since_2016']

print(f'{citations_all}\n{citations_2016}\n{h_inedx_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n')

public_access_link = results['public_access']['link']
public_access_available_articles = results['public_access']['available']

print(f'{public_access_link}\n{public_access_available_articles}\n')

# Graph results
for graph_results in results['cited_by']['graph']:
  year = graph_results['year']
  citations = graph_results['citations']

  print(f'{year} {citations}\n')

部分输出:

946
563
17
12
27
18

https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=m8dFEawAAAAJ
23

2004 6

2005 20

2006 11
<块引用>

免责声明,我为 SerpApi 工作。