我试图从Google学术搜索个人资料中检索信息。我有url
from bs4 import SoupStrainer, BeautifulSoup
from urllib2 import Request, urlopen
url = "https://scholar.google.com/citations?user=qc6CJjYAAAAJ"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
我可以通过以下方式获取h_index
,i10_index
和citations
等信息:
indexes = soup.find_all("td", "gsc_rsb_std")
h_index = indexes[2].string
i10_index = indexes[4].string
citations = indexes[0].string
现在,我想知道如何获取每年引用的总和,如Google学术搜索图表所示。
答案 0 :(得分:2)
使用re.compile
查找三类表数据,包括年份引用:
from bs4 import BeautifulSoup as soup
import urllib
import re
s = soup(str(urllib.urlopen('https://scholar.google.com/citations?user=qc6CJjYAAAAJ').read()), 'lxml')
final_data = [[b.text for b in i] for i in s.find_all('td', {'class':re.compile('gsc_a_t|gsc_a_c|gsc_a_y')})]
grouped_data = [final_data[i:i+3] for i in range(0, len(final_data), 3)]
citations = [dict(zip(['title', 'cited by', 'year'], map(lambda x:x[0], i))) for i in grouped_data]
输出:
[{'year': u'1935', 'cited by': u'16471', 'title': u'Can quantum-mechanical description of physical reality be considered complete?'}, {'year': u'1905', 'cited by': u'10925', 'title': u'Uber einen die Erzeugung und Verwandlung des Lichtes betreffenden heurischen Gesichtpunkt'}, {'year': u'1905', 'cited by': u'9425', 'title': u'On the movement of small particles suspended in stationary liquids required by the molecular-kinetic theory of heat'}, {'year': u'1956', 'cited by': u'4648', 'title': u'Investigations on the Theory of the Brownian Movement'}, {'year': u'', 'cited by': u'4540', 'title': u'Zur Elektrodynamik bewegter K\xf6rper'}, {'year': u'1911', 'cited by': u'4285', 'title': u'Graviton Mass and Inertia Mass'}, {'year': u'1918', 'cited by': u'4196', 'title': u'On gravitational waves Sitzungsber. preuss'}, {'year': u'1925', 'cited by': u'3947', 'title': u'Sitzungsber. K'}, {'year': u'1917', 'cited by': u'3914', 'title': u'Sitzungsberichte der Preussischen Akad. d'}, {'year': u'1906', 'cited by': u'3633', 'title': u'Eine neue bestimmung der molek\xfcldimensionen'}, {'year': u'1950', 'cited by': u'3538', 'title': u'The meaning of relativity'}, {'year': u'1998', 'cited by': u'3472', 'title': u'Ueber einen die Erzeugung und Verwandlung des Lichtes betreffenden heuristischen Gesichtspunkt'}, {'year': u'1915', 'cited by': u'3065', 'title': u'Sitzungsberichte der Preussischen Akademie der Wissenschaften zu Berlin'}, {'year': u'1954', 'cited by': u'2969', 'title': u'Evolution of Physics'}, {'year': u'1920', 'cited by': u'2919', 'title': u'The special and general theory'}, {'year': u'2006', 'cited by': u'2804', 'title': u'Die grundlage der allgemeinen relativit\xe4tstheorie'}, {'year': u'1982', 'cited by': u'2643', 'title': u'The Science and the Life of Albert Einstein'}, {'year': u'1917', 'cited by': u'2570', 'title': u'Zur quantentheorie der strahlung'}, {'year': u'1954', 'cited by': u'2489', 'title': u'Physics and Reality, in \u201cIdeas and Opinions\u201d'}, {'year': u'1924', 'cited by': u'2440', 'title': u'Quantum theory of monatomic ideal gases'}]
编辑:要查找图表中的值,请略微更改传递给find_all
的数据:
from bs4 import BeautifulSoup as soup
import urllib
import re
s = soup(str(urllib.urlopen('https://scholar.google.com/citations?user=qc6CJjYAAAAJ').read()), 'lxml')
years = map(int, [i.text for i in s.find_all('span', {'class':'gsc_g_t'})])
citation_number = map(int, [i.text for i in s.find_all('span', {'class':'gsc_g_al'})])
final_chart_data = dict(zip(years, citation_number))
输出:
{1979: 774, 1980: 649, 1981: 572, 1982: 722, 1983: 680, 1984: 725, 1985: 743, 1986: 664, 1987: 776, 1988: 792, 1989: 879, 1990: 924, 1991: 831, 1992: 1071, 1993: 1016, 1994: 1197, 1995: 1300, 1996: 1283, 1997: 1409, 1998: 1433, 1999: 1777, 2000: 1987, 2001: 2300, 2002: 2347, 2003: 2449, 2004: 2927, 2005: 4436, 2006: 4059, 2007: 4476, 2008: 4409, 2009: 4709, 2010: 4586, 2011: 5139, 2012: 5797, 2013: 6160, 2014: 5985, 2015: 6463, 2016: 6760, 2017: 6356, 2018: 396}
答案 1 :(得分:0)
您还可以在学术上使用
import scholarly
author_name="..."
author = next(scholarly.search_author(author_name)).fill()
pubs = author.publications
此外,要在学者输出中进行更高级的可视化和文本分析,您可以检查我使用的学者库:https://github.com/tyiannak/pyScholar
答案 2 :(得分:0)
您可以使用像 SerpApi 这样的第三方解决方案来抓取用户个人资料右侧的图表。这是一个免费试用的付费 API。
示例 Python 代码(也可在其他库中使用):
from serpapi import GoogleSearch
params = {
"api_key": "SECRET_API_KEY",
"engine": "google_scholar_author",
"author_id": "qc6CJjYAAAAJ",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
示例 JSON 输出:
"cited_by": {
"table": [
{
"citations": {
"all": 139782,
"since_2016": 37906
}
},
{
"h_index": {
"all": 119,
"since_2016": 66
}
},
{
"i10_index": {
"all": 394,
"since_2016": 219
}
}
],
"graph": [
{
"year": 1982,
"citations": 614
},
{
"year": 1983,
"citations": 680
},
{
"year": 1984,
"citations": 736
},
...
]
}
您可以查看documentation了解更多详情。
免责声明:我在 SerpApi 工作。