Question

我想从一个没有分页的网站上抓取所有链接，即有一个“LOAD MORE”按钮，但URL不会根据您要求的数据量而改变。

当我BeautifulSoup页面并询问所有链接时，它只显示网站香草首页上的链接数量。我可以通过点击“加载更多”按钮手动点击旧内容，但有没有办法以编程方式执行此操作？

这就是我的意思：

page = urllib2.urlopen('http://www.thedailybeast.com/politics.html')
soup = soup = BeautifulSoup(page)

for link in soup.find_all('a'):
    print link.get('href')

不幸的是，没有负责分页的网址。

Answer 1

点击＆＃34;加载更多＆＃34;按钮，向http://www.thedailybeast.com/politics.view.<page_number>.json端点发出 XHR请求。您需要在代码中模拟它并解析JSON响应。使用requests的工作示例：

import requests

with requests.Session() as session:
    for page in range(1, 10):
        print("Page number #%s" % page)
        response = session.get("http://www.thedailybeast.com/politics.view.%s.json" % page)
        data = response.json()

        for article in data["stream"]:
            print(article["title"])

打印：

Page number #1
The Two Americas Behind Donald Trump and Bernie Sanders
...
Hillary Clinton’s Star-Studded NYC Bash: Katy Perry, Jamie Foxx, and More Toast the Candidate
Why Do These Republicans Hate Maya Angelou’s Post Office?
Page number #2
No, Joe Biden Is Not a Supreme Court Hypocrite
PC Hysteria Claims Another Professor
WHY BLACK CELEB ENDORSEMENTS MATTER MOST
...
Inside Trump’s Make Believe Presidential Addresses
...

从单页网站获取与BeautifulSoup的所有链接（'加载更多'功能）

1 个答案: