如何检查标记<a> or/and <img/> is the children of div on Beautiful Soup

时间:2018-02-09 05:36:33

标签: python python-3.x web-scraping beautifulsoup

So, let's say I have page like this inside of the <body> tag

<!-- Tag <a> with <img> inside of it -->
<div class="album_item">
    <a href="http://www.foo.com/img/1"><img src="http://thumbnail.foo.com/img/1.jpg" /></a>
    <a href="http://www.foo.com/img/2"><img src="http://thumbnail.foo.com/img/2.jpg" /></a>
    <a href="http://www.foo.com/img/3"><img src="http://thumbnail.foo.com/img/3.jpg" /></a>
    <a href="http://www.foo.com/img/4"><img src="http://thumbnail.foo.com/img/4.jpg" /></a>

</div>

<!-- Only tag <img> -->
<div class="album_item">
    <img src="http://large.foo.com/img/5.jpg" />
    <img src="http://large.foo.com/img/6.jpg" />
</div>

<!-- Combination Of Both Above -->
<div class="album_item">
    <a href="http://www.foo.com/img/7"><img src="http://thumbnail.foo.com/img/7.jpg" /></a>
    <a href="http://www.foo.com/img/8"><img src="http://thumbnail.foo.com/img/8.jpg" /></a>
    <a href="http://www.foo.com/img/9"><img src="http://thumbnail.foo.com/img/9.jpg" /></a>
    <a href="http://www.foo.com/img/10"><img src="http://thumbnail.foo.com/img/10.jpg" /></a>

    <img src="http://large.foo.com/img/11.jpg" />
    <img src="http://large.foo.com/img/12.jpg" />
</div>

And I want to scrap using the code below :

import requests
from bs4 import BeautifulSoup as soup

my_url = 'http://www.foo-url.com'

uClient = requests.get(my_url)
page_html = uClient.text
uClient.close()

page_soup = soup(page_html, "html.parser")

#Identify Each Post Group
containers = page_soup.findAll("div",{"class": "album-item"})

data = []

for container in containers:
    #Store Each Pictures To An Object
    items = container.findAll("a")

    for item in items:
        #Set The Link Location
        link_location = item.attrs['href']
        image_item = item.find("img")

        #Set The Image Location
        img_location = image_item.attrs['src']

        data.append((link_location, img_location))

    #Just Incase Only Image
    imgs = container.findAll("img")

    for img in imgs:
        link_location = "NoLink"
        img_location = img.attrs['src']
        data.append((link_location, img_location))

for link_location, img_location in data:
    print(link_location + " | " + img_location)

And On the result, There is a lot of duplicates like this :

http://www.foo.com/img/1 | http://thumbnail.foo.com/img/1.jpg
http://www.foo.com/img/2 | http://thumbnail.foo.com/img/2.jpg
http://www.foo.com/img/3 | http://thumbnail.foo.com/img/3.jpg
http://www.foo.com/img/4 | http://thumbnail.foo.com/img/4.jpg

NoLink | http://thumbnail.foo.com/img/1.jpg       #duplicate
NoLink | http://thumbnail.foo.com/img/2.jpg       #duplicate
NoLink | http://thumbnail.foo.com/img/3.jpg       #duplicate
NoLink | http://thumbnail.foo.com/img/4.jpg       #duplicate

NoLink | http://large.foo.com/img/5.jpg
NoLink | http://large.foo.com/img/6.jpg

http://www.foo.com/img/7 | http://thumbnail.foo.com/img/7.jpg
http://www.foo.com/img/8 | http://thumbnail.foo.com/img/8.jpg
http://www.foo.com/img/9 | http://thumbnail.foo.com/img/9.jpg
http://www.foo.com/img/10 | http://thumbnail.foo.com/img/10.jpg

NoLink | http://thumbnail.foo.com/img/7.jpg       #duplicate
NoLink | http://thumbnail.foo.com/img/8.jpg       #duplicate
NoLink | http://thumbnail.foo.com/img/9.jpg       #duplicate
NoLink | http://thumbnail.foo.com/img/10.jpg      #duplicate

NoLink | http://large.foo.com/img/11.jpg
NoLink | http://large.foo.com/img/12.jpg

My idea is, to check inside of the <div class="album_item">
if all of the children tag <a> , then do the for item in items:
else if all of the children tag <img> , then do the for img in imgs:
but then what if there are both of tag ?

And I am not sure how check that tag either
On the first <div>
I tried to use if(container.select("img")) which should be false,
but the value is true because it detect the tag <img> that is inside of tag <a>

So, how should I approach this ?

1 个答案:

答案 0 :(得分:2)

您想要的是 tag.find_all(recursive=False)

来自documentation

  

如果你致电mytag.find_all(),美丽的汤将检查所有   mytag的后代:它的子女,孩子的孩子,等等   上。如果你只想要美丽的汤来考虑直接的孩子,你   可以通过recursive=False

在您的代码中,更改此行

imgs = container.findAll("img")

imgs = container.findAll("img", recursive=False)

输出:

http://www.foo.com/img/1 | http://thumbnail.foo.com/img/1.jpg
http://www.foo.com/img/2 | http://thumbnail.foo.com/img/2.jpg
http://www.foo.com/img/3 | http://thumbnail.foo.com/img/3.jpg
http://www.foo.com/img/4 | http://thumbnail.foo.com/img/4.jpg
NoLink | http://large.foo.com/img/5.jpg
NoLink | http://large.foo.com/img/6.jpg
http://www.foo.com/img/7 | http://thumbnail.foo.com/img/7.jpg
http://www.foo.com/img/8 | http://thumbnail.foo.com/img/8.jpg
http://www.foo.com/img/9 | http://thumbnail.foo.com/img/9.jpg
http://www.foo.com/img/10 | http://thumbnail.foo.com/img/10.jpg
NoLink | http://large.foo.com/img/11.jpg
NoLink | http://large.foo.com/img/12.jpg