Question

我正试图从Google学者中提取Jean Tirole的所有文章（网址：https://scholar.google.com/citations?hl=en&user=ZEDUm5UAAAAJ&view_op=list_works&sortby=title）。下载网址后，我尝试了以下操作：

tirole_parent <- read_html("jean_tirole_GoogleScholarCitations.html")
tirole_table <- tirole_parent %>% 
                html_nodes(xpath = '//*[@id="gsc_a_b"]') %>% 
                html_nodes(xpath = "tr") %>%
                html_nodes(xpath = "td") %>%
                html_text()

但是，这只给了我20篇文章。如何从HTML获取所有文章？

Answer 1

如果您希望grep标题名称，那么它的正确名称是gsc_a_at
当您按下Show More时，实际上它是使用以下参数XHR和cstart发出pagesize的请求。
cstart是从其开始的页面。 pagesize是每页的总结果，最大值为100。
总结果为660，所以我将从0开始到660。

from bs4 import BeautifulSoup
import requests

for start in range(0, 700, 100):
    r = requests.get(
        f"https://scholar.google.com/citations?hl=en&user=ZEDUm5UAAAAJ&view_op=list_works&sortby=title&cstart={start}&pagesize=100")
    soup = BeautifulSoup(r.text, features="html.parser")
    for item in soup.findAll('a', attrs={'class': 'gsc_a_at'}):
        print(item.text)

您可以通过That Link

在线检查输出

如何从HTML对象提取所有信息（包括未显示的信息）

1 个答案: