从单页网站获取与BeautifulSoup的所有链接('加载更多'功能)

时间:2016-03-07 16:46:42

标签: python html web-scraping beautifulsoup

我想从一个没有分页的网站上抓取所有链接,即有一个“LOAD MORE”按钮,但URL不会根据您要求的数据量而改变。

当我BeautifulSoup页面并询问所有链接时,它只显示网站香草首页上的链接数量。我可以通过点击“加载更多”按钮手动点击旧内容,但有没有办法以编程方式执行此操作?

这就是我的意思:

page = urllib2.urlopen('http://www.thedailybeast.com/politics.html')
soup = soup = BeautifulSoup(page)

for link in soup.find_all('a'):
    print link.get('href')

不幸的是,没有负责分页的网址。

1 个答案:

答案 0 :(得分:3)

点击&#34;加载更多&#34;按钮,向http://www.thedailybeast.com/politics.view.<page_number>.json端点发出 XHR请求。您需要在代码中模拟它并解析JSON响应。使用requests的工作示例:

import requests

with requests.Session() as session:
    for page in range(1, 10):
        print("Page number #%s" % page)
        response = session.get("http://www.thedailybeast.com/politics.view.%s.json" % page)
        data = response.json()

        for article in data["stream"]:
            print(article["title"])

打印:

Page number #1
The Two Americas Behind Donald Trump and Bernie Sanders
...
Hillary Clinton’s Star-Studded NYC Bash: Katy Perry, Jamie Foxx, and More Toast the Candidate
Why Do These Republicans Hate Maya Angelou’s Post Office?
Page number #2
No, Joe Biden Is Not a Supreme Court Hypocrite
PC Hysteria Claims Another Professor
WHY BLACK CELEB ENDORSEMENTS MATTER MOST
...
Inside Trump’s Make Believe Presidential Addresses
...