使用python和beatifulsoup从https://www.open2study.com/courses中删除图像和标题

时间:2015-11-07 14:42:00

标签: python beautifulsoup

from bs4 import BeautifulSoup 
import urllib 

r = urllib.urlopen('https://www.open2study.com/courses').read() 

soup = BeautifulSoup(r) 
links = soup.find('figure').find_all('img', src=True) 

for link in links: 
    txt = open('test.txt' , "w") 
    link = link["src"].split("src=")[-1] 
    download_img = urllib.urlopen('https://www.open2study.com/courses') 
    txt.write(download_img.read()) 
    txt.close()

我需要从this website抓取图像和标题。

2 个答案:

答案 0 :(得分:1)

您可以直接使用split

抓取src,而不是beautifulsoup

使用此选项可获取包含标题和图片的div

for link in soup.find_all("div",attrs={"class" : "courses_adblock_start"}):

然后使用它来获取每个div中的标题和图像:

    link.find("h2",attrs={"class":"adblock_course_title"}).get_text())
    link.find("img", attrs={"class":"image-style-course-logo-subjects-block"}).get("src"))

您还要在循环中每次打开页面时要避免,只需打开一次然后将其用于循环,如下所示:

url = "http://www.open2study.com/courses" 
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())

for link in soup.find_all("div",attrs={"class" : "courses_adblock_start"}):
    try:
        print("Title : " + link.find("h2",attrs={"class":"adblock_course_title"}).get_text())
        print("Image : " + link.find("img", attrs={"class":"image-style-course-logo-subjects-block"}).get("src"))
    except:
        print("error")

这是新输出:

Title : World Music
Image : https://www.open2study.com/sites/default/files/styles/course_logo_subjects_block/public/Course%20Tile_world_music.jpg?itok=CG6pvXHp
Title : Writing for the Web
Image : https://www.open2study.com/sites/default/files/styles/course_logo_subjects_block/public/3_writing_for_web_C_0.jpg?itok=exQApr-1

答案 1 :(得分:1)

这样的东西?

import urllib
from bs4 import BeautifulSoup

titles = []
images = []

r = urllib.urlopen('https://www.open2study.com/courses').read()
soup = BeautifulSoup(r)

for i in soup.find_all('div', {'class': "courses_adblock_rollover"}):
    titles.append(i.h2.text)

for i in soup.find_all(
    'img', {
        'class': "image-style-course-logo-subjects-block"}):
    images.append(i.get('src'))

with open('test.txt', "w") as f:
    for i in zip(titles, images):
        f.write(i[0].encode('ascii', 'ignore') +
                '\n'+i[1].encode('ascii', 'ignore') +
                '\n\n')