Python:如何每年检索Google学术搜索引用?

时间:2018-01-29 14:33:59

标签: python parsing

我试图从Google学术搜索个人资料中检索信息。我有url

from bs4 import SoupStrainer, BeautifulSoup
from urllib2 import Request, urlopen

url = "https://scholar.google.com/citations?user=qc6CJjYAAAAJ"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser') 

我可以通过以下方式获取h_indexi10_indexcitations等信息:

indexes = soup.find_all("td", "gsc_rsb_std")
h_index = indexes[2].string
i10_index = indexes[4].string
citations = indexes[0].string

现在,我想知道如何获取每年引用的总和,如Google学术搜索图表所示。

3 个答案:

答案 0 :(得分:2)

使用re.compile查找三类表数据,包括年份引用:

from bs4 import BeautifulSoup as soup
import urllib
import re
s = soup(str(urllib.urlopen('https://scholar.google.com/citations?user=qc6CJjYAAAAJ').read()), 'lxml')
final_data = [[b.text for b in i] for i in s.find_all('td', {'class':re.compile('gsc_a_t|gsc_a_c|gsc_a_y')})]
grouped_data = [final_data[i:i+3] for i in range(0, len(final_data), 3)]
citations = [dict(zip(['title', 'cited by', 'year'], map(lambda x:x[0], i))) for i in grouped_data]

输出:

[{'year': u'1935', 'cited by': u'16471', 'title': u'Can quantum-mechanical description of physical reality be considered complete?'}, {'year': u'1905', 'cited by': u'10925', 'title': u'Uber einen die Erzeugung und Verwandlung des Lichtes betreffenden heurischen Gesichtpunkt'}, {'year': u'1905', 'cited by': u'9425', 'title': u'On the movement of small particles suspended in stationary liquids required by the molecular-kinetic theory of heat'}, {'year': u'1956', 'cited by': u'4648', 'title': u'Investigations on the Theory of the Brownian Movement'}, {'year': u'', 'cited by': u'4540', 'title': u'Zur Elektrodynamik bewegter K\xf6rper'}, {'year': u'1911', 'cited by': u'4285', 'title': u'Graviton Mass and Inertia Mass'}, {'year': u'1918', 'cited by': u'4196', 'title': u'On gravitational waves Sitzungsber. preuss'}, {'year': u'1925', 'cited by': u'3947', 'title': u'Sitzungsber. K'}, {'year': u'1917', 'cited by': u'3914', 'title': u'Sitzungsberichte der Preussischen Akad. d'}, {'year': u'1906', 'cited by': u'3633', 'title': u'Eine neue bestimmung der molek\xfcldimensionen'}, {'year': u'1950', 'cited by': u'3538', 'title': u'The meaning of relativity'}, {'year': u'1998', 'cited by': u'3472', 'title': u'Ueber einen die Erzeugung und Verwandlung des Lichtes betreffenden heuristischen Gesichtspunkt'}, {'year': u'1915', 'cited by': u'3065', 'title': u'Sitzungsberichte der Preussischen Akademie der Wissenschaften zu Berlin'}, {'year': u'1954', 'cited by': u'2969', 'title': u'Evolution of Physics'}, {'year': u'1920', 'cited by': u'2919', 'title': u'The special and general theory'}, {'year': u'2006', 'cited by': u'2804', 'title': u'Die grundlage der allgemeinen relativit\xe4tstheorie'}, {'year': u'1982', 'cited by': u'2643', 'title': u'The Science and the Life of Albert Einstein'}, {'year': u'1917', 'cited by': u'2570', 'title': u'Zur quantentheorie der strahlung'}, {'year': u'1954', 'cited by': u'2489', 'title': u'Physics and Reality, in \u201cIdeas and Opinions\u201d'}, {'year': u'1924', 'cited by': u'2440', 'title': u'Quantum theory of monatomic ideal gases'}]

编辑:要查找图表中的值,请略微更改传递给find_all的数据:

from bs4 import BeautifulSoup as soup
import urllib
import re
s = soup(str(urllib.urlopen('https://scholar.google.com/citations?user=qc6CJjYAAAAJ').read()), 'lxml')
years = map(int, [i.text for i in s.find_all('span', {'class':'gsc_g_t'})])
citation_number = map(int, [i.text for i in s.find_all('span', {'class':'gsc_g_al'})])
final_chart_data = dict(zip(years, citation_number))

输出:

{1979: 774, 1980: 649, 1981: 572, 1982: 722, 1983: 680, 1984: 725, 1985: 743, 1986: 664, 1987: 776, 1988: 792, 1989: 879, 1990: 924, 1991: 831, 1992: 1071, 1993: 1016, 1994: 1197, 1995: 1300, 1996: 1283, 1997: 1409, 1998: 1433, 1999: 1777, 2000: 1987, 2001: 2300, 2002: 2347, 2003: 2449, 2004: 2927, 2005: 4436, 2006: 4059, 2007: 4476, 2008: 4409, 2009: 4709, 2010: 4586, 2011: 5139, 2012: 5797, 2013: 6160, 2014: 5985, 2015: 6463, 2016: 6760, 2017: 6356, 2018: 396}

答案 1 :(得分:0)

您还可以在学术上使用

import scholarly
author_name="..."
author = next(scholarly.search_author(author_name)).fill()
pubs = author.publications

此外,要在学者输出中进行更高级的可视化和文本分析,您可以检查我使用的学者库:https://github.com/tyiannak/pyScholar

答案 2 :(得分:0)

您可以使用像 SerpApi 这样的第三方解决方案来抓取用户个人资料右侧的图表。这是一个免费试用的付费 API。

示例 Python 代码(也可在其他库中使用):

from serpapi import GoogleSearch

params = {
  "api_key": "SECRET_API_KEY",
  "engine": "google_scholar_author",
  "author_id": "qc6CJjYAAAAJ",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()

示例 JSON 输出:

"cited_by": {
  "table": [
    {
      "citations": {
        "all": 139782,
        "since_2016": 37906
      }
    },
    {
      "h_index": {
        "all": 119,
        "since_2016": 66
      }
    },
    {
      "i10_index": {
        "all": 394,
        "since_2016": 219
      }
    }
  ],
  "graph": [
    {
      "year": 1982,
      "citations": 614
    },
    {
      "year": 1983,
      "citations": 680
    },
    {
      "year": 1984,
      "citations": 736
    },
    ...
  ]
}

您可以查看documentation了解更多详情。

免责声明:我在 SerpApi 工作。