通过硒网络抓取谷歌学者

时间:2021-04-04 12:52:13

标签: python-3.x selenium selenium-webdriver

我正在尝试从谷歌学者页面抓取 Citations、h-index 和 i10-index,并使用 selenium webdriver 将其存储在 Pandas 数据框中。下面是我的网络驱动程序代码。

# install chromium, its driver, and selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium
# set options to be headless, ..
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# open it, go to a website, and get results
wd = webdriver.Chrome('chromedriver',options=options)

wd.get("https://scholar.google.com/citations?user=kukA0LcAAAAJ&hl=en&oi=ao")
divs = wd.find_elements_by_class_name('gsc_rsb'

for i in divs[0].find_elements_by_tag_name('a'):
  #print(i)
  print(i.get_attribute('text'))

结果如下:

Get my own profile
Citations
h-index
i10-index
1051
1163
1365
1762
2001
2707
4192
7293
13372
25177
40711
65915
87193
101992
21846
View all
Aaron Courville
Pascal Vincent
Kyunghyun Cho
Ian Goodfellow
Yann LeCun
Hugo Larochelle
Caglar Gulcehre
Dzmitry Bahdanau
David Warde-Farley
Xavier Glorot
Razvan Pascanu
Leon Bottou
Sherjil Ozair
Mehdi Mirza
James Bergstra
Olivier Delalleau
Anirudh Goyal
Pascal Lamblin
Patrick Haffner
Nicolas Le Roux

但我只需要引用、h-index、i10-index,如下面的熊猫数据框:

|  Name       |   Citations(All)   | Citations(since2016) | i10-index|i10-index(since2016)|
+-------------+--------------------+----------------------+----------+--------------------+
|Yoshua Bengio|   387118           |   343301             |  181     |        164         |

如何通过上面的代码实现这一点?

1 个答案:

答案 0 :(得分:0)

divs = wd.find_elements_by_class_name('gsc_rsb' #no closing bracket

SelectorGadget 结合 CSS / select() select_one() 方法快速查找 BeautifulSoup 选择器。

请注意,即使使用 seleniumrequests-html,它仍然可能会抛出 CAPTCHA。如果自定义标头没有帮助,您可以做的第一件事是向您的请求添加代理。

这是我的想法(它也适用于其他个人资料):

from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

browser = webdriver.Chrome()
browser.get('https://scholar.google.com/citations?user=kukA0LcAAAAJ')
html = browser.page_source

soup = BeautifulSoup(html, 'lxml')

name = soup.select_one('#gsc_prf_in').text

cititations_all = soup.select_one('tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std').text
cititations_since_2016 = soup.select_one('tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std').text

h_index_all = soup.select_one('tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std').text
h_index_since_2016 = soup.select_one('tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std').text

i_10_index_all = soup.select_one('tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std').text
i_10_index_since_2016 = soup.select_one('tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std').text

data = {
    "Name": [name],
    "Citations": cititations_all,
    "Citations Since 2016": cititations_since_2016,
    "h-index": h_index_all,
    "h-index Since 2016": h_index_since_2016,
    "i10-index": i_10_index_all,
    "i10-index Since 2016": i_10_index_since_2016,
}

df = pd.DataFrame(data)

print(df)

输出:

            Name Citations Citations 2016 h-index h-index 2016 i10-index  \
0  Yoshua Bengio    387118         343301     181          164       625   

  i10-index 2016  
0            55

使用 requests-html - docs 的其他解决方案。

代码:

from requests_html import HTMLSession
import pandas as pd

session = HTMLSession()
url = 'https://scholar.google.com/citations?user=kukA0LcAAAAJ'
r = session.get(url)
r.html.render()

name = r.html.find('#gsc_prf_in', first = True).text

citations_all = r.html.find('.gsc_rsb_std', first = True).text
citations_2016 = r.html.find('#gsc_rsb_st > tbody > tr:nth-child(1) > td:nth-child(3)', first = True).text

h_index = r.html.find('#gsc_rsb_st > tbody > tr:nth-child(2) > td:nth-child(2)', first = True).text
h_index_2016 = r.html.find('#gsc_rsb_st > tbody > tr:nth-child(2) > td:nth-child(3)', first = True).text

i10_index = r.html.find('#gsc_rsb_st > tbody > tr:nth-child(3) > td:nth-child(2)', first = True).text
i10_index_2016 = r.html.find('#gsc_rsb_st > tbody > tr:nth-child(3) > td:nth-child(3)', first = True).text

data = {
    "Name": [name],
    "Citations": citations_all,
    "Citations 2016": citations_2016,
    "h-index": h_index,
    "h-index 2016": h_index_2016,
    "i10-index": i10_index,
    "i10-index 2016": i10_index_2016,
}

df = pd.DataFrame(data)

print(df)

输出(我猜 PyCharm 不喜欢控制台中的长名称,所以它用 ... 替换它们,但值在那里):

            Name Citations Citations 2016  ... h-index 2016 i10-index i10-index 2016
0  Yoshua Bengio    388599         344787  ...          164       627            552

或者,您可以使用来自 SerpApi 的 Google Scholar Author Cited By API。这是一个付费 API,可免费试用 5,000 次搜索。查看 playground 进行测试。

要集成的代码:

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google_scholar_author",
  "author_id": "9PepYk8AAAAJ",
  "hl": "en",
}

search = GoogleSearch(params)
results = search.get_dict()

citations_all = results['cited_by']['table'][0]['citations']['all']
citations_2016 = results['cited_by']['table'][0]['citations']['since_2016']
h_inedx_all = results['cited_by']['table'][1]['h_index']['all']
h_index_2016 = results['cited_by']['table'][1]['h_index']['since_2016']
i10_index_all = results['cited_by']['table'][2]['i10_index']['all']
i10_index_2016 = results['cited_by']['table'][2]['i10_index']['since_2016']

print(f'{citations_all}\n{citations_2016}\n{h_inedx_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n')

public_access_link = results['public_access']['link']
public_access_available_articles = results['public_access']['available']

print(f'{public_access_link}\n{public_access_available_articles}\n')

# Output:
'''
67595
28238
110
63
966
448

https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=9PepYk8AAAAJ
7
'''
<块引用>

免责声明,我为 SerpApi 工作。