I am trying to create a bot that scrapes all the image links from a site and store them somewhere else so I can download the images after.
from selenium import webdriver
import time
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.artstation.com/artwork?sorting=trending'
page = requests.get(url)
driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)
soup = bs(driver.page_source, 'html.parser')
gallery = soup.find_all(class_="image-src")
data = gallery[0]
for x in range(len(gallery)):
print("TAG:", sep="\n")
print(gallery[x], sep="\n")
if page.status_code == 200:
print("Request OK")
This returns all the links tags i wanted but I can't find a way to remove the html or copy only the links to a new list. Here is an example of the tag i get:
<div class="image-src" image-src="https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301" ng-if="::!project.hide_as_adult"></div>
So, how do i get only the links within the gallery[] list? What i want to do after is to take this links and edit the /smaller-square/ directory to /large/, which is the one that has the high resolution image.
答案 0 :(得分:2)
页面通过AJAX加载数据,因此通过网络检查器我们可以看到调用的位置。此代码段将获取在页面1
上找到的所有图像链接,并按trending
进行排序:
import requests
import json
url = 'https://www.artstation.com/projects.json?page=1&sorting=trending'
page = requests.get(url)
json_data = json.loads(page.text)
for data in json_data['data']:
print(data['cover']['medium_image_url'])
打印:
https://cdna.artstation.com/p/assets/images/images/012/272/796/medium/ben-zhang-brigitte-hero-concept.jpg?1533921480
https://cdna.artstation.com/p/assets/covers/images/012/279/572/medium/ham-sung-choul-braveking-140823-1-3-s3-mini.jpg?1533959982
https://cdnb.artstation.com/p/assets/covers/images/012/275/963/medium/michael-vicente-orb-gem-thumb.jpg?1533933774
https://cdnb.artstation.com/p/assets/images/images/012/275/635/medium/michael-kutsche-piglet-by-michael-kutsche.jpg?1533932387
https://cdna.artstation.com/p/assets/images/images/012/273/384/medium/ben-zhang-unnamed.jpg?1533923353
https://cdnb.artstation.com/p/assets/covers/images/012/273/083/medium/michael-vicente-orb-guardian-thumb.jpg?1533922229
... and so on.
如果打印变量json_data
,您将看到页面发送的其他信息(例如图标图像url,total_count,有关作者的数据等)
答案 1 :(得分:1)
您可以使用键值访问属性。
例如:
from bs4 import BeautifulSoup
s = '''<div class="image-src" image-src="https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301" ng-if="::!project.hide_as_adult"></div>'''
soup = BeautifulSoup(s, "html.parser")
print(soup.find("div", class_="image-src")["image-src"])
#or
print(soup.find("div", class_="image-src").attrs['image-src'])
输出:
https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301