Python:提取类中的所有信息(src,href,title)

时间:2017-10-24 02:01:58

标签: python html web-scraping beautifulsoup

我发现我可以从这个HTML中提取我想要的所有信息。我需要从中提取title,href abd src。

HTML:

    <div class="col-md-2 col-sm-2 col-xs-2 home-hot-thumb">
        <a itemprop="url" href="/slim?p=3090" class="main">
            <img src="/FileUploads/Post/3090.jpg?w=70&h=70&mode=crop" alt="apple" title="apple" />
        </a>
    </div>
    <div class="col-md-2 col-sm-2 col-xs-2 home-hot-thumb">
        <a itemprop="url" href="/slim?p=3091" class="main">
            <img src="/FileUploads/Post/3091.jpg?w=70&h=70&mode=crop" alt="banana" title="banana" />
        </a>
    </div>

代码:

import requests
from bs4 import BeautifulSoup

res = requests.get('http://www.cad.com/')
soup = BeautifulSoup(res.text,"lxml")
for a in soup.findAll('div', {"id":"home"}):
    for b in a.select(".main"): 
        print ("http://www.cad.com"+b.get('href'))
        print(b.get('title'))

我可以成功地从中获取href,但由于title和src在另一行,我不知道如何提取它们。在此之后,我想将它们保存在excel中,所以也许我需要先完成一个然后再完成第二个。

预期产出:

/slim?p=3090
apple
/FileUploads/Post/3091.jpg?w=70&h=70&mode=crop" alt="banana" title="banana
/slim?p=3091
banana
/FileUploads/Post/3091.jpg?w=70&h=70&mode=crop" alt="banana" title="banana

1 个答案:

答案 0 :(得分:0)

我自己的解决方案:

import requests
from bs4 import BeautifulSoup

res = requests.get('http://www.cad.com/')
soup = BeautifulSoup(res.text,"lxml")
for a in soup.findAll('div', {"id":"home"}):
    div = a.findAll('div', {"class": "home-hot-thumb"})
    for div in div:
        title=(div.img.get('title'))
        print(title)
        href=('http://www.cad.com/'+div.a.get('href'))
        print(href)
        src=('http://www.cad.com/'+div.img.get('src'))
        print(src.replace('?w=70&h=70&mode=crop', ''))