来自BeautifulSoup的链接,没有href或<a>

时间:2018-08-11 04:33:20

标签: python python-3.x selenium web-scraping beautifulsoup

I am trying to create a bot that scrapes all the image links from a site and store them somewhere else so I can download the images after.

from selenium import webdriver
import time
from bs4 import BeautifulSoup as bs  
import requests

url = 'https://www.artstation.com/artwork?sorting=trending'
page = requests.get(url)
driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)
soup = bs(driver.page_source, 'html.parser')
gallery =  soup.find_all(class_="image-src")
data = gallery[0]
for x in range(len(gallery)):
    print("TAG:", sep="\n")
    print(gallery[x], sep="\n")

if page.status_code == 200:  
    print("Request OK")

This returns all the links tags i wanted but I can't find a way to remove the html or copy only the links to a new list. Here is an example of the tag i get:

<div class="image-src" image-src="https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301" ng-if="::!project.hide_as_adult"></div>

So, how do i get only the links within the gallery[] list? What i want to do after is to take this links and edit the /smaller-square/ directory to /large/, which is the one that has the high resolution image.

2 个答案:

答案 0 :(得分:2)

页面通过AJAX加载数据,因此通过网络检查器我们可以看到调用的位置。此代码段将获取在页面1上找到的所有图像链接,并按trending进行排序:

import requests
import json

url = 'https://www.artstation.com/projects.json?page=1&sorting=trending'
page = requests.get(url)
json_data = json.loads(page.text)

for data in json_data['data']:
    print(data['cover']['medium_image_url'])

打印:

https://cdna.artstation.com/p/assets/images/images/012/272/796/medium/ben-zhang-brigitte-hero-concept.jpg?1533921480
https://cdna.artstation.com/p/assets/covers/images/012/279/572/medium/ham-sung-choul-braveking-140823-1-3-s3-mini.jpg?1533959982
https://cdnb.artstation.com/p/assets/covers/images/012/275/963/medium/michael-vicente-orb-gem-thumb.jpg?1533933774
https://cdnb.artstation.com/p/assets/images/images/012/275/635/medium/michael-kutsche-piglet-by-michael-kutsche.jpg?1533932387
https://cdna.artstation.com/p/assets/images/images/012/273/384/medium/ben-zhang-unnamed.jpg?1533923353
https://cdnb.artstation.com/p/assets/covers/images/012/273/083/medium/michael-vicente-orb-guardian-thumb.jpg?1533922229

... and so on.

如果打印变量json_data,您将看到页面发送的其他信息(例如图标图像url,total_count,有关作者的数据等)

答案 1 :(得分:1)

您可以使用键值访问属性。

例如:

from bs4 import BeautifulSoup
s = '''<div class="image-src" image-src="https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301" ng-if="::!project.hide_as_adult"></div>'''
soup = BeautifulSoup(s, "html.parser")
print(soup.find("div", class_="image-src")["image-src"])
#or
print(soup.find("div", class_="image-src").attrs['image-src'])

输出:

https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301