我正在尝试使用Beautiful Soup从Kickstarter webpage抓取项目的URL。我正在使用以下代码:
import requests
from bs4 import BeautifulSoup
url = 'https://www.kickstarter.com/discover/advanced?category_id=28&staff_picks=1&sort=newest&seed=2639586&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
project_name_list = soup.find(class_='grid-row flex flex-wrap')
project_name_list_items = project_name_list.find_all('a')
print(project_name_list_items)
for project_name in project_name_list_items:
links = project_name.get('href')
print(links)
但这是我得到的输出:
[<a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>]
None
None
None
None
None
None
我尝试了几种方法,例如:
for link in soup.find_all('a'):
print(link.get('href'))
但仍然没有结果。 另外,我要抓取的页面在页面末尾有一个“加载更多”部分。如何获得该部分中的URL? 感谢您的帮助。
答案 0 :(得分:3)
数据不是嵌入在html本身中,而是作为JSON嵌入在名为data-project
的html属性中。一种解决方案是使用find_all("div")
并仅签出具有该属性的对象
虽然url以JSON形式存在,但另一个名为ref
的html属性中存在一个名为data-ref
的查询参数。以下内容获取第1页的所有链接
import requests
from bs4 import BeautifulSoup
import json
url = 'https://www.kickstarter.com/discover/advanced?category_id=28&staff_picks=1&sort=newest&seed=2639586&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
data = [
(json.loads(i["data-project"]), i["data-ref"])
for i in soup.find_all("div")
if i.get("data-project")
]
for i in data:
print(f'{i[0]["urls"]["web"]["project"]}?ref={i[1]}')
然后,您只需增加page
查询参数即可迭代页面(“加载更多”按钮)