使用美丽汤从Kickstarter抓取项目网址

时间:2020-03-29 05:33:31

标签: python python-3.x web-scraping beautifulsoup

我正在尝试使用Beautiful Soup从Kickstarter webpage抓取项目的URL。我正在使用以下代码:

import requests
from bs4 import BeautifulSoup

url = 'https://www.kickstarter.com/discover/advanced?category_id=28&staff_picks=1&sort=newest&seed=2639586&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

project_name_list = soup.find(class_='grid-row flex flex-wrap')

project_name_list_items = project_name_list.find_all('a')
print(project_name_list_items)

for project_name in project_name_list_items:
    links = project_name.get('href')
    print(links)

但这是我得到的输出:

[<a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>]
None
None
None
None
None
None

我尝试了几种方法,例如:

for link in soup.find_all('a'):
    print(link.get('href'))

但仍然没有结果。 另外,我要抓取的页面在页面末尾有一个“加载更多”部分。如何获得该部分中的URL? 感谢您的帮助。

1 个答案:

答案 0 :(得分:3)

数据不是嵌入在html本身中,而是作为JSON嵌入在名为data-project的html属性中。一种解决方案是使用find_all("div")并仅签出具有该属性的对象

虽然url以JSON形式存在,但另一个名为ref的html属性中存在一个名为data-ref的查询参数。以下内容获取第1页的所有链接

import requests
from bs4 import BeautifulSoup
import json

url = 'https://www.kickstarter.com/discover/advanced?category_id=28&staff_picks=1&sort=newest&seed=2639586&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

data = [
    (json.loads(i["data-project"]), i["data-ref"])
    for i in soup.find_all("div")
    if i.get("data-project")
]

for i in data:
    print(f'{i[0]["urls"]["web"]["project"]}?ref={i[1]}')

然后,您只需增加page查询参数即可迭代页面(“加载更多”按钮)

相关问题