Question

我正在开发一个从Google学术搜索中获取数据的项目。我想刮掉作者的h指数，总引用率和i-10指数（全部）。例如，Louisa Gilbert我想刮掉：

h-index = 36
i10-index = 74
citations = 4383

我写了这个：

from bs4 import BeautifulSoup
import urllib.request
url="https://scholar.google.ca/citations?user=OdQKi7wAAAAJ&hl=en"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')

但我不确定如何继续。（我知道有一些库可用，但没有一个允许你刮掉h-index和i10-index的。）

Answer 1

你几乎就在那里。您需要找到包含要提取的数据的HTML元素。在此特定情况下，索引包含在标记<td class="gsc_rsb_std">中。您需要从Soup元素中获取这些标记，然后使用方法string从标记内恢复文本：

indexes = soup.find_all("td", "gsc_rsb_std")
h_index = indexes[2].string
i10_index = indexes[4].string
citations = indexes[0].string

Answer 2

要从 Google Scholar Author 页面抓取所有信息，您可以使用第三方解决方案，例如 SerpApi。这是一个免费试用的付费 API。

示例 Python 代码（也可在其他库中使用）：

from serpapi import GoogleSearch

params = {
  "api_key": "SECRET_API_KEY",
  "engine": "google_scholar_author",
  "hl": "en",
  "author_id": "-muoO7gAAAAJ"
}

search = GoogleSearch(params)
results = search.get_dict()

示例 JSON 输出：

"cited_by": {
  "table": [
    {
      "citations": {
        "all": 7326,
        "since_2016": 2613
      }
    },
    {
      "h_index": {
        "all": 47,
        "since_2016": 27
      }
    },
    {
      "i10_index": {
        "all": 103,
        "since_2016": 79
      }
    }
  ]
}

您可以查看documentation了解更多详情。

免责声明：我在 SerpApi 工作。

来自Google Scholar的作者h-index，i10-index和总引用量

2 个答案: