Question

鉴于Google学术搜索中的典型关键字搜索（请参见屏幕截图），我想获取一个字典，其中包含页面上显示的每个出版物的标题和网址（例如。results = {'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells', 'url': 'https://www.nature.com/articles/338427a0'}。

要从Google学术搜索中检索结果页面，我使用以下代码：

from urllib import FancyURLopener, quote_plus
from bs4 import BeautifulSoup

class AppURLOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'

openurl = AppURLOpener().open
query = "Vicia faba"
url = 'https://scholar.google.com/scholar?q=' + quote_plus(query) + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
#print url
content = openurl(url).read()
page = BeautifulSoup(content, 'lxml')
print page

此代码以（非常难看的）HTML格式正确返回结果页面。但是，我还没有能够超越这一点，因为我无法弄清楚如何使用BeautifulSoup（我不太熟悉）来解析结果页面并检索数据。

请注意，问题在于从结果页面解析和提取数据，而不是Google学术搜索本身，因为上述代码正确检索了结果页面。

有人可以提一些提示吗？提前致谢！

Answer 1

检查网页内容会显示搜索结果包含在h3标记中，其属性为class="gs_rt"。您可以使用BeautifulSoup仅提取这些标记，然后从每个条目中的<a>标记中获取标题和URL。将每个标题/ URL写入dict，并存储在dicts列表中：

import requests
from bs4 import BeautifulSoup

query = "Vicia%20faba"
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'

content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
    results.append({"title": entry.a.text, "url": entry.a['href']})

输出：

[{'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells',
  'url': 'https://www.nature.com/articles/338427a0'},
 {'title': 'Hydrogen peroxide is involved in abscisic acid-induced stomatal closure in Vicia faba',
  'url': 'http://www.plantphysiol.org/content/126/4/1438.short'},
 ...]

注意：我使用requests代替urllib，因为我的urllib无法加载FancyURLopener。但是无论你如何获得页面内容，BeautifulSoup语法应该是相同的。

Answer 2

在回答这个问题时来自 andrew_reece 的答案不起作用，即使具有正确类的 h3 标记位于源代码中，它仍然会抛出错误，例如获取验证码，因为 Google 检测到您的脚本是自动脚本。打印回复以查看消息。

我发送了太多请求后得到了这个：

The block will expire shortly after those requests stop.
Sometimes you may be asked to solve the CAPTCHA
if you are using advanced terms that robots are known to use, 
or sending requests very quickly.

您可以做的第一件事是向您的请求添加代理：

#https://docs.python-requests.org/en/master/user/advanced/#proxies

proxies = {
  'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv()
}

请求代码如下：

html = requests.get('google scholar link', headers=headers, proxies=proxies).text

或者您可以使用 requests-HTML 或 selenium 或 pyppeteer 而无需代理，只需渲染页面即可使其工作。

代码：

# If you'll get an empty array, this means you get a CAPTCHA. 

from requests_html import HTMLSession
import json

session = HTMLSession()
response = session.get('https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=vicia+faba&btnG=')

# https://requests-html.kennethreitz.org/#javascript-support
response.html.render()

results = []

# Container where data we need is located
for result in response.html.find('.gs_ri'):
    title = result.find('.gs_rt', first = True).text
    # print(title)
    
    # converting dict of URLs to strings (see how it will be without next() iter())
    url = next(iter(result.absolute_links))
    # print(url)

    results.append({
        'title': title,
        'url': url,
    })

print(json.dumps(results, indent = 2, ensure_ascii = False))

部分输出：

[
  {
    "title": "Faba bean (Vicia faba L.)",
    "url": "https://www.sciencedirect.com/science/article/pii/S0378429097000257"
  },
  {
    "title": "Nutritional value of faba bean (Vicia faba L.) seeds for feed and food",
    "url": "https://scholar.google.com/scholar?cluster=956029896799880103&hl=en&as_sdt=0,5"
  }
]

基本上，您可以对 Google Scholar API 中的 SerpApi 执行相同操作。但是您不必渲染页面或使用浏览器自动化，例如selenium 从 Google 学术搜索获取数据。获得即时 JSON 输出，这将比 selenium 或 reqests-html 更快，无需考虑如何绕过 Google 阻止。

这是一个付费 API，可试用 5,000 次搜索。目前正在开发完全免费的试用版。

要集成的代码：

from serpapi import GoogleSearch
import json

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google_scholar",
  "q": "vicia faba",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()

results_data = []

for result in results['organic_results']:
    title = result['title']
    url = result['link']

    results_data.append({
        'title': title,
        'url': url,
    })
    
print(json.dumps(results_data, indent = 2, ensure_ascii = False))

部分输出：

[
  {
    "title": "Faba bean (Vicia faba L.)",
    "url": "https://www.sciencedirect.com/science/article/pii/S0378429097000257"
  },
  {
    "title": "Nutritional value of faba bean (Vicia faba L.) seeds for feed and food",
    "url": "https://www.sciencedirect.com/science/article/pii/S0378429009002512"
  },
]

<块引用>

免责声明，我为 SerpApi 工作。

使用Python和BeautifulSoup解析Google Scholar结果

2 个答案: