我想从一个没有分页的网站上抓取所有链接,即有一个“LOAD MORE”按钮,但URL不会根据您要求的数据量而改变。
当我BeautifulSoup
页面并询问所有链接时,它只显示网站香草首页上的链接数量。我可以通过点击“加载更多”按钮手动点击旧内容,但有没有办法以编程方式执行此操作?
这就是我的意思:
page = urllib2.urlopen('http://www.thedailybeast.com/politics.html')
soup = soup = BeautifulSoup(page)
for link in soup.find_all('a'):
print link.get('href')
不幸的是,没有负责分页的网址。
答案 0 :(得分:3)
点击&#34;加载更多&#34;按钮,向http://www.thedailybeast.com/politics.view.<page_number>.json
端点发出 XHR请求。您需要在代码中模拟它并解析JSON响应。使用requests
的工作示例:
import requests
with requests.Session() as session:
for page in range(1, 10):
print("Page number #%s" % page)
response = session.get("http://www.thedailybeast.com/politics.view.%s.json" % page)
data = response.json()
for article in data["stream"]:
print(article["title"])
打印:
Page number #1
The Two Americas Behind Donald Trump and Bernie Sanders
...
Hillary Clinton’s Star-Studded NYC Bash: Katy Perry, Jamie Foxx, and More Toast the Candidate
Why Do These Republicans Hate Maya Angelou’s Post Office?
Page number #2
No, Joe Biden Is Not a Supreme Court Hypocrite
PC Hysteria Claims Another Professor
WHY BLACK CELEB ENDORSEMENTS MATTER MOST
...
Inside Trump’s Make Believe Presidential Addresses
...