我一直在使用以下代码来解析链接https://www.blogforacure.com/members.php中的网页。代码应该返回给定页面的所有成员的链接。
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('https://www.blogforacure.com/members.php').read()
soup = BeautifulSoup(r,'lxml')
headers = soup.find_all('h3')
print(len(headers))
for header in headers:
a = header.find('a')
print(a.attrs['href'])
但我只从上一页获得前10个链接。即使在打印美化选项时,我也只看到前10个链接。
答案 0 :(得分:1)
通过向https://www.blogforacure.com/site/ajax/scrollergetentries.php
端点发出AJAX请求来动态加载结果。
使用requests
维护网络抓取会话在代码中模拟它们:
from bs4 import BeautifulSoup
import requests
url = "https://www.blogforacure.com/site/ajax/scrollergetentries.php"
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
session.get("https://www.blogforacure.com/members.php")
page = 0
members = []
while True:
# get page
response = session.post(url, data={
"p": str(page),
"id": "#scrollbox1"
})
html = response.json()['html']
# parse html
soup = BeautifulSoup(html, "html.parser")
page_members = [member.get_text() for member in soup.select(".memberentry h3 a")]
print(page, page_members)
members.extend(page_members)
page += 1
它将当前页码和每页累积成员名称的成员列表打印到members
列表中。不发布它打印的内容,因为它包含名称。
请注意,我故意让循环无休止,请弄清楚退出条件。可能是response.json()
抛出错误。