刮下"显示更多"

时间:2015-01-25 09:14:16

标签: web-scraping beautifulsoup

我正在尝试使用BeautifulSoup从特定网站(Google学术搜索)中抓取具有相同标记的所有对象,但它不会在页面末尾的“显示更多”下废弃该对象。我该如何解决?

以下是我的代码示例:

# -*- coding: cp1253 -*-
from urllib import urlopen
from bs4 import BeautifulSoup
webpage=urlopen('http://scholar.google.gr/citations?user=FwuKA4UAAAAJ&hl=el')
soup=BeautifulSoup(webpage)
for t in soup.findAll('a',{"class":"gsc_a_at"}):
      print t.text

2 个答案:

答案 0 :(得分:1)

您必须将分页参数传递给请求 url。

cstart - 参数定义结果偏移量。它跳过给定数量的结果。它用于分页。 (例如,0(默认)为第一页结果,20 为第二页结果,40 为第三页结果等)。

pagesize - 参数定义要返回的结果数。 (例如,20(默认)返​​回 20 个结果,40 个返回 40 个结果等)。要返回的最大结果数为 100。

您也可以使用像 SerpApi 这样的第三方解决方案来为您执行此操作。这是一个免费试用的付费 API。

用于检索第二页结果的示例 Python 代码(也可在其他库中使用):

from serpapi import GoogleSearch

params = {
  "engine": "google_scholar_author",
  "hl": "en",
  "author_id": "FwuKA4UAAAAJ",
  "start": "20",
  "api_key": "secret_api_key"
}

search = GoogleSearch(params)
results = search.get_dict()

示例 JSON 输出:

"articles": [
  {
    "title": "MuseumScrabble: Design of a mobile game for children’s interaction with a digitally augmented cultural space",
    "link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FwuKA4UAAAAJ&cstart=20&citation_for_view=FwuKA4UAAAAJ:RHpTSmoSYBkC",
    "citation_id": "FwuKA4UAAAAJ:RHpTSmoSYBkC",
    "authors": "C Sintoris, A Stoica, I Papadimitriou, N Yiannoutsou, V Komis, N Avouris",
    "publication": "Social and organizational impacts of emerging mobile devices: Evaluating use …, 2012",
    "cited_by": {
      "value": 69,
      "link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=6286720977869955347",
      "serpapi_link": "https://serpapi.com/search.json?cites=6286720977869955347&engine=google_scholar&hl=en",
      "cites_id": "6286720977869955347"
    },
    "year": "2012"
  },
  {
    "title": "The effective combination of hybrid usability methods in evaluating educational applications of ICT: Issues and challenges",
    "link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FwuKA4UAAAAJ&cstart=20&citation_for_view=FwuKA4UAAAAJ:hqOjcs7Dif8C",
    "citation_id": "FwuKA4UAAAAJ:hqOjcs7Dif8C",
    "authors": "N Tselios, N Avouris, V Komis",
    "publication": "Education and Information Technologies 13 (1), 55-76, 2008",
    "cited_by": {
      "value": 68,
      "link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=1046912849634390721",
      "serpapi_link": "https://serpapi.com/search.json?cites=1046912849634390721&engine=google_scholar&hl=en",
      "cites_id": "1046912849634390721"
    },
    "year": "2008"
  },
  ...

查看documentation了解更多详情。

免责声明:我在 SerpApi 工作。

答案 1 :(得分:0)

在Chrome中,尝试F12 - >网络,选择'保留日志'并禁用缓存。 现在按下节目更多按钮。

检查正在发送的GET / POST请求。你会知道接下来该做什么。