BeautifulSoup,Google学术搜索,作者姓名,从属关系和引文

时间:2015-03-12 12:45:01

标签: python beautifulsoup google-scholar

我想从Google学术搜索获得所有作者姓名。我的基本网址为http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security 所以基本上,我寻找那些写过安全性的作者。

我使用BeautifulSoup编写了一些Python脚本,但是(不知道为什么)脚本显示空列表, 因为它没有找到任何给定的元素(但是,当我查看页面源时,我看到有<div class="gsc_1usr_text">个元素。)

继承我的代码:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
mydivs = soup.findAll("div", { "class" : "gsc_1usr_text" })
print mydivs

,输出为[]print "LEN = " + str(len(mydivs))显示0。

我在 Linux Mint 13 上使用 Python 2.7.3

2 个答案:

答案 0 :(得分:1)

您的代码适合我。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib2
from bs4 import BeautifulSoup
url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
mydivs = soup.findAll("div", { "class" : "gsc_1usr_text" })
print mydivs

输出:

[<div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=n-Oret4AAAAJ&amp;hl=pl&amp;oe=Latin2">Adrian Perrig</a></h3><div class="gsc_1usr_aff">Professor of Computer Science at ETH Zürich, Adjunct Professor of ECE and EPP at CMU</div><div class="gsc_1usr_eml">Zweryfikowany adres z inf.ethz.ch</div><div class="gsc_1usr_emlb">@inf.ethz.ch</div><div class="gsc_1usr_cby">Cytowane przez 40938</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:networking">Networking</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:operating_systems">Operating Systems</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:computer_security">Computer Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:internet_security">Internet Security</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=HvwPRJ0AAAAJ&amp;hl=pl&amp;oe=Latin2">Vern Paxson</a></h3><div class="gsc_1usr_aff">Professor, EECS, University of California, Berkeley</div><div class="gsc_1usr_eml">Zweryfikowany adres z berkeley.edu</div><div class="gsc_1usr_emlb">@berkeley.edu</div><div class="gsc_1usr_cby">Cytowane przez 39914</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:networking">Networking</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:measurement">Measurement</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=2pW1g5IAAAAJ&amp;hl=pl&amp;oe=Latin2">Mihir Bellare</a></h3><div class="gsc_1usr_aff">Professor, Department of Computer Science and Engineering, UCSD</div><div class="gsc_1usr_eml">Zweryfikowany adres z eng.ucsd.edu</div><div class="gsc_1usr_emlb">@eng.ucsd.edu</div><div class="gsc_1usr_cby">Cytowane przez 35459</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:cryptography">Cryptography</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:complexity_theory">Complexity Theory</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=FCsdj0YAAAAJ&amp;hl=pl&amp;oe=Latin2">Wenyuan Xu</a></h3><div class="gsc_1usr_aff">Assistant Profess of Department of Computer Science and Engineering, University of South  …</div><div class="gsc_1usr_cby">Cytowane przez 32521</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:wireless_networks">Wireless Networks</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:jamming_defenses">jamming defenses</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:dependable_systems">dependable systems</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=vWTI60AAAAAJ&amp;hl=pl&amp;oe=Latin2">Martin Abadi</a></h3><div class="gsc_1usr_aff">Principal Scientist, Google, and Professor Emeritus, UC Santa Cruz</div><div class="gsc_1usr_eml">Zweryfikowany adres z cs.ucsc.edu</div><div class="gsc_1usr_emlb">@cs.ucsc.edu</div><div class="gsc_1usr_cby">Cytowane przez 29938</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:programming_languages_and_systems">programming languages and systems</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:specification_and_verification">specification and verification</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=lOZ1vHIAAAAJ&amp;hl=pl&amp;oe=Latin2">Sushil Jajodia</a></h3><div class="gsc_1usr_aff">University Professor, BDM International Professor, and Director, Center for Secure  …</div><div class="gsc_1usr_eml">Zweryfikowany adres z gmu.edu</div><div class="gsc_1usr_emlb">@gmu.edu</div><div class="gsc_1usr_cby">Cytowane przez 29705</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:privacy">privacy</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:database">database</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:databases">databases</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:distributed_systems">distributed systems</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=Z_enRVYAAAAJ&amp;hl=pl&amp;oe=Latin2">Xiaolan Zhang</a></h3><div class="gsc_1usr_aff">IBM</div><div class="gsc_1usr_eml">Zweryfikowany adres z us.ibm.com</div><div class="gsc_1usr_emlb">@us.ibm.com</div><div class="gsc_1usr_cby">Cytowane przez 27321</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:virtualization">Virtualization</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:systems">Systems</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=W7YBLlEAAAAJ&amp;hl=pl&amp;oe=Latin2">Jean-Pierre Hubaux</a></h3><div class="gsc_1usr_aff">Professor, EPFL</div><div class="gsc_1usr_eml">Zweryfikowany adres z epfl.ch</div><div class="gsc_1usr_emlb">@epfl.ch</div><div class="gsc_1usr_cby">Cytowane przez 24738</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:privacy">Privacy</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:networking">Networking</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=WgyDcoUAAAAJ&amp;hl=pl&amp;oe=Latin2">Ross Anderson</a></h3><div class="gsc_1usr_aff">University of Cambridge</div><div class="gsc_1usr_eml">Zweryfikowany adres z cl.cam.ac.uk</div><div class="gsc_1usr_emlb">@cl.cam.ac.uk</div><div class="gsc_1usr_cby">Cytowane przez 24445</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:cryptology">cryptology</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:dependability">dependability</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:technology_policy">technology policy</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=lsKlsJ8AAAAJ&amp;hl=pl&amp;oe=Latin2">Heejo Lee</a></h3><div class="gsc_1usr_aff">Professor of Computer Science, Korea University</div><div class="gsc_1usr_eml">Zweryfikowany adres z korea.ac.kr</div><div class="gsc_1usr_emlb">@korea.ac.kr</div><div class="gsc_1usr_cby">Cytowane przez 23596</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:network">network</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=pl&amp;oe=Latin2&amp;mauthors=label:security">security</a> </div></div>]

答案 1 :(得分:0)

您可能发送了过多的请求,或者 Google 检测到您的脚本是自动脚本。

您可以尝试做的第一件事是向您的请求添加代理:

{% extends 'main.html' %}
{% load static %}

{% block title %}

Catégories

{% endblock title %}

{% block content %}

    <h3>////////////</h3>
    <h3>"Boutique" page</h3>
    <h3>Catégories de produits</h3>
    <h3>//////////////</h3>

    <div> 

    {% for c in categories %}

    <div> {{c.name | title }} </div>

    {% endfor %}

    </div>

{% endblock content %}

或者,您可以通过使用 #https://docs.python-requests.org/en/master/user/advanced/#proxies proxies = { 'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv() } requests-html 在不使用代理的情况下呈现整个 HTML 页面来使其工作,但您仍然可以获得 CAPTCHA。

使其工作的代码(我在本地测试了代码):

selenium

输出:

# If you get an empty array, you get an CAPTCHA from Google.
# Print response to see what cause it.
# Note: code below doesn't do pagination. https://requests-html.kennethreitz.org/#pagination

from requests_html import HTMLSession

session = HTMLSession()
url = 'https://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security'
response = session.get(url)
# https://requests-html.kennethreitz.org/#requests_html.HTML.render
response.html.render(sleep=1)

for author_name in response.html.find('.gs_ai_name'):
    name = author_name.text
    print(name)

或者,您可以使用来自 SerpApi 的 Google Scholar Profiles API。这是一个付费 API,可试用 5,000 次搜索。目前正在开发完全免费的试用版。

主要区别在于您不必考虑解决验证码或由于渲染页面或具有多个实例的压力 PC 而经历缓慢的抓取过程,例如使用 Johnson Thomas Martin Abadi Adrian Perrig Vern Paxson Frans Kaashoek Mihir Bellare Matei Zaharia Helen J. Wang Zhu Han Sushil Jajodia

要集成的代码:

selenium

输出:

from serpapi import GoogleSearch

params = {
  "engine": "google_scholar_profiles",
  "hl": "en",
  "mauthors": "label:security",
  "api_key": "YOUR_API_KEY"
}

search = GoogleSearch(params)
results = search.get_dict()

for author_name in results['profiles']:
    name = author_name['name']
    print(name)

部分 JSON 输出:

Johnson Thomas
Martin Abadi
Adrian Perrig
Vern Paxson
Frans Kaashoek
Mihir Bellare
Matei Zaharia
Helen J. Wang
Zhu Han
Sushil Jajodia
<块引用>

免责声明,我为 SerpApi 工作。