我正在尝试从谷歌学者页面抓取 Citations、h-index 和 i10-index,并使用 selenium webdriver 将其存储在 Pandas 数据框中。下面是我的网络驱动程序代码。
# install chromium, its driver, and selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium
# set options to be headless, ..
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# open it, go to a website, and get results
wd = webdriver.Chrome('chromedriver',options=options)
wd.get("https://scholar.google.com/citations?user=kukA0LcAAAAJ&hl=en&oi=ao")
divs = wd.find_elements_by_class_name('gsc_rsb'
for i in divs[0].find_elements_by_tag_name('a'):
#print(i)
print(i.get_attribute('text'))
结果如下:
Get my own profile
Citations
h-index
i10-index
1051
1163
1365
1762
2001
2707
4192
7293
13372
25177
40711
65915
87193
101992
21846
View all
Aaron Courville
Pascal Vincent
Kyunghyun Cho
Ian Goodfellow
Yann LeCun
Hugo Larochelle
Caglar Gulcehre
Dzmitry Bahdanau
David Warde-Farley
Xavier Glorot
Razvan Pascanu
Leon Bottou
Sherjil Ozair
Mehdi Mirza
James Bergstra
Olivier Delalleau
Anirudh Goyal
Pascal Lamblin
Patrick Haffner
Nicolas Le Roux
但我只需要引用、h-index、i10-index,如下面的熊猫数据框:
| Name | Citations(All) | Citations(since2016) | i10-index|i10-index(since2016)|
+-------------+--------------------+----------------------+----------+--------------------+
|Yoshua Bengio| 387118 | 343301 | 181 | 164 |
如何通过上面的代码实现这一点?
答案 0 :(得分:0)
divs = wd.find_elements_by_class_name('gsc_rsb' #no closing bracket
SelectorGadget 结合 CSS
/ select()
select_one()
方法快速查找 BeautifulSoup
选择器。
请注意,即使使用 selenium
或 requests-html
,它仍然可能会抛出 CAPTCHA。如果自定义标头没有帮助,您可以做的第一件事是向您的请求添加代理。
这是我的想法(它也适用于其他个人资料):
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
browser = webdriver.Chrome()
browser.get('https://scholar.google.com/citations?user=kukA0LcAAAAJ')
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
name = soup.select_one('#gsc_prf_in').text
cititations_all = soup.select_one('tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std').text
cititations_since_2016 = soup.select_one('tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std').text
h_index_all = soup.select_one('tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std').text
h_index_since_2016 = soup.select_one('tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std').text
i_10_index_all = soup.select_one('tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std').text
i_10_index_since_2016 = soup.select_one('tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std').text
data = {
"Name": [name],
"Citations": cititations_all,
"Citations Since 2016": cititations_since_2016,
"h-index": h_index_all,
"h-index Since 2016": h_index_since_2016,
"i10-index": i_10_index_all,
"i10-index Since 2016": i_10_index_since_2016,
}
df = pd.DataFrame(data)
print(df)
输出:
Name Citations Citations 2016 h-index h-index 2016 i10-index \
0 Yoshua Bengio 387118 343301 181 164 625
i10-index 2016
0 55
使用 requests-html
- docs 的其他解决方案。
代码:
from requests_html import HTMLSession
import pandas as pd
session = HTMLSession()
url = 'https://scholar.google.com/citations?user=kukA0LcAAAAJ'
r = session.get(url)
r.html.render()
name = r.html.find('#gsc_prf_in', first = True).text
citations_all = r.html.find('.gsc_rsb_std', first = True).text
citations_2016 = r.html.find('#gsc_rsb_st > tbody > tr:nth-child(1) > td:nth-child(3)', first = True).text
h_index = r.html.find('#gsc_rsb_st > tbody > tr:nth-child(2) > td:nth-child(2)', first = True).text
h_index_2016 = r.html.find('#gsc_rsb_st > tbody > tr:nth-child(2) > td:nth-child(3)', first = True).text
i10_index = r.html.find('#gsc_rsb_st > tbody > tr:nth-child(3) > td:nth-child(2)', first = True).text
i10_index_2016 = r.html.find('#gsc_rsb_st > tbody > tr:nth-child(3) > td:nth-child(3)', first = True).text
data = {
"Name": [name],
"Citations": citations_all,
"Citations 2016": citations_2016,
"h-index": h_index,
"h-index 2016": h_index_2016,
"i10-index": i10_index,
"i10-index 2016": i10_index_2016,
}
df = pd.DataFrame(data)
print(df)
输出(我猜 PyCharm 不喜欢控制台中的长名称,所以它用 ...
替换它们,但值在那里):
Name Citations Citations 2016 ... h-index 2016 i10-index i10-index 2016
0 Yoshua Bengio 388599 344787 ... 164 627 552
或者,您可以使用来自 SerpApi 的 Google Scholar Author Cited By API。这是一个付费 API,可免费试用 5,000 次搜索。查看 playground 进行测试。
要集成的代码:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_author",
"author_id": "9PepYk8AAAAJ",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
citations_all = results['cited_by']['table'][0]['citations']['all']
citations_2016 = results['cited_by']['table'][0]['citations']['since_2016']
h_inedx_all = results['cited_by']['table'][1]['h_index']['all']
h_index_2016 = results['cited_by']['table'][1]['h_index']['since_2016']
i10_index_all = results['cited_by']['table'][2]['i10_index']['all']
i10_index_2016 = results['cited_by']['table'][2]['i10_index']['since_2016']
print(f'{citations_all}\n{citations_2016}\n{h_inedx_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n')
public_access_link = results['public_access']['link']
public_access_available_articles = results['public_access']['available']
print(f'{public_access_link}\n{public_access_available_articles}\n')
# Output:
'''
67595
28238
110
63
966
448
https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=9PepYk8AAAAJ
7
'''
<块引用>
免责声明,我为 SerpApi 工作。