from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('https://www.open2study.com/courses').read()
soup = BeautifulSoup(r)
links = soup.find('figure').find_all('img', src=True)
for link in links:
txt = open('test.txt' , "w")
link = link["src"].split("src=")[-1]
download_img = urllib.urlopen('https://www.open2study.com/courses')
txt.write(download_img.read())
txt.close()
我需要从this website抓取图像和标题。
答案 0 :(得分:1)
您可以直接使用split
src
,而不是beautifulsoup
。
使用此选项可获取包含标题和图片的div
for link in soup.find_all("div",attrs={"class" : "courses_adblock_start"}):
然后使用它来获取每个div中的标题和图像:
link.find("h2",attrs={"class":"adblock_course_title"}).get_text())
link.find("img", attrs={"class":"image-style-course-logo-subjects-block"}).get("src"))
您还要在循环中每次打开页面时要避免,只需打开一次然后将其用于循环,如下所示:
url = "http://www.open2study.com/courses"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
for link in soup.find_all("div",attrs={"class" : "courses_adblock_start"}):
try:
print("Title : " + link.find("h2",attrs={"class":"adblock_course_title"}).get_text())
print("Image : " + link.find("img", attrs={"class":"image-style-course-logo-subjects-block"}).get("src"))
except:
print("error")
这是新输出:
Title : World Music
Image : https://www.open2study.com/sites/default/files/styles/course_logo_subjects_block/public/Course%20Tile_world_music.jpg?itok=CG6pvXHp
Title : Writing for the Web
Image : https://www.open2study.com/sites/default/files/styles/course_logo_subjects_block/public/3_writing_for_web_C_0.jpg?itok=exQApr-1
答案 1 :(得分:1)
这样的东西?
import urllib
from bs4 import BeautifulSoup
titles = []
images = []
r = urllib.urlopen('https://www.open2study.com/courses').read()
soup = BeautifulSoup(r)
for i in soup.find_all('div', {'class': "courses_adblock_rollover"}):
titles.append(i.h2.text)
for i in soup.find_all(
'img', {
'class': "image-style-course-logo-subjects-block"}):
images.append(i.get('src'))
with open('test.txt', "w") as f:
for i in zip(titles, images):
f.write(i[0].encode('ascii', 'ignore') +
'\n'+i[1].encode('ascii', 'ignore') +
'\n\n')