BeautifulSoup没有抓住' img src'正如所料

时间:2014-04-02 21:46:27

标签: python html beautifulsoup

尝试使用BeautifulSoup解析Bing图像结果中的图像网址。

这最初表现如预期:

from bs4 import BeautifulSoup
import requests

def get_soup(url):
    return BeautifulSoup(requests.get(url).text)

query = 'doggy'
url = "http://www.bing.com/images/search?q=" + query +
      "&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3"
soup = get_soup(url)

但是,以下内容返回一个空列表而不是一个URL列表:

bimg = re.compile("mm.bing.net")
img_links = soup.find_all("img", {"src": bimg})
print img_links

当我print soup.prettify()时,我可以看到我想要的网址。看起来所有img标签都可能位于脚本中 - 是否可以在BS4中扮演角色而不是看到它们?

以下是一些包含网址的美化输出。

<script type="text/javascript">
  //<![CDATA[
var t = '<div class="iol_fp" id="iol_bg"></div><div id="iol_ph"></div><div id="iol_dp"><button id="iol_cls" title="Close"></button><div id="iol_ip"><div id="iol_imp">
<div id="iol_imw"></div><div class="iol_nav" id="iol_navl"></div><div class="iol_nav" id="iol_navr"></div></div><div id="iol_mdb"><span class="iol_mdi" id="iol_md"><span id="iol_mdis"></span><span id="iol_sep">·</span><a id="iol_mdit"></a></span>
<span id="iol_bspan"><button class="iol_mdi" id="iol_pin" href="#" title="Pin to Pinterest"></button><button class="iol_mdi" id="iol_vl" href="#">Show larger</button><button class="iol_mdi" id="iol_vs" href="#">Show smaller</button>
<button class="iol_mdi" id="iol_ss" href="#">Play All</button><button class="iol_mdi" id="iol_sse" href="#">Pause</button></span></div><div id="iol_fsw"><div id="iol_fscb"></div><div id="iol_fsc"></div></div></div><div id="iol_sp"><div id="iol_rs">
<div id="iol_rst">ALSO CONSIDER</div><span id="iol_rsp"><div><div class="iol_rsc"><a href="/images/search?q=Doggy+GIF+Style+1+2+3&amp;Form=IQFRDR" class="iol_rsi" title="Search for: Doggy GIF Style 1 2 3" h="ID=images,5187.2">
<img src="http://ts4.mm.bing.net/th?q=Doggy+GIF+Style+1+2+3&w=50&h=50&c=1&pid=1.7&adlt=moderate"/><span class="iol_rsiq">Doggy<br/><strong>GIF Style 1 2 3</strong></span></a></div><div class="iol_rsc">
<a href="/images/search?q=Puppies&amp;Form=IQFRDR" class="iol_rsi" title="Search for: Puppies" h="ID=images,5189.2"><img src="http://ts1.mm.bing.net/th?q=Puppies&w=50&h=50&c=1&pid=1.7&adlt=moderate"/>
<span class="iol_rsiq"><strong>Puppies</strong></span></a></div><div class="iol_rsc"><a href="/images/search?q=Funny+Doggies&amp;Form=IQFRDR" class="iol_rsi" title="Search for: Funny Doggies" h="ID=images,5191.2">
<img src="http://ts4.mm.bing.net/th?q=Funny+Doggies&w=50&h=50&c=1&pid=1.7&adlt=moderate"/><span class="iol_rsiq"><strong>Funny</strong><br/>Doggies</span></a></div><div class="iol_rsc"><a href="/images/search?q=Doggie+Dentures&amp;Form=IQFRDR" class="iol_rsi" title="Search for: Doggie Dentures" h="ID=images,5193.2">
<img src="http://ts1.mm.bing.net/th?q=Doggie+Dentures&w=50&h=50&c=1&pid=1.7&adlt=moderate"/><span class="iol_rsiq"><strong>Doggie Dentures</strong></span></a></div><div class="iol_rsc">
<a href="/images/search?q=Cute+Doggies&amp;Form=IQFRDR" class="iol_rsi" title="Search for: Cute Doggies" h="ID=images,5195.2"><img src="http://ts3.mm.bing.net/th?q=Cute+Doggies&w=50&h=50&c=1&pid=1.7&adlt=moderate"/>
<span class="iol_rsiq"><strong>Cute</strong><br/>Doggies

非常感谢任何帮助!

3 个答案:

答案 0 :(得分:1)

@alecxe走在正确的轨道上 - 这是html5的一个问题。我安装了html5lib库,以下代码解决了这个问题:

from bs4 import BeautifulSoup
import requests
import html5lib

def get_soup(url):
   return BeautifulSoup(requests.get(url).text, 'html5lib')

query = 'doggy'
url = "http://www.bing.com/images/search?q=" + query +
  "&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3"
soup = get_soup(url)

感谢您的帮助。

答案 1 :(得分:0)

import urllib, bs4
from bs4 import *

url = "http://www.bing.com/images/search?q=%s&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3" % 'doggy'

html_page = urllib.urlopen(url)
soup = BeautifulSoup(html_page)

links = soup.find_all("img")

img_links = []

for link in links:
    img_links.append(str(link.get('src')))

for x in range(0, 10):  
    for x in range(0, len(img_links)):
        try:
            if "http://" in img_links[x]:
                pass
            else:
                del img_links[x]
        except:
            break

试试这个。

链接应位于img_links列表中。

答案 2 :(得分:0)

from bs4 import BeautifulSoup
import requests
import re

def get_soup(url):
    request = requests.get(url).content
    return BeautifulSoup(request)

query = 'doggy'
url = "http://www.bing.com/images/search?q=" + query + "&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3"
soup = get_soup(url)
bimg = re.compile('.*mm.bing.net.*')
img_links = soup.find_all("img", {'src': bimg})
for link in img_links:
    print link

稍微调整你的正则表达式

<img src="http://ts3.mm.bing.net/th?q=Rabbit&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Cow&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Tiger&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Elephant&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Fish&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts4.mm.bing.net/th?q=Fox&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Animal&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Chicken+Bird&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Domestic+Sheep&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Giraffe&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Puppy&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Dolphin&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Pet&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts4.mm.bing.net/th?q=Baby+Birds&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts4.mm.bing.net/th?q=Labrador+Retriever&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Chihuahua&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Cat&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Lion&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Zebra&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Bulldog&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>