尝试使用BeautifulSoup解析Bing图像结果中的图像网址。
这最初表现如预期:
from bs4 import BeautifulSoup
import requests
def get_soup(url):
return BeautifulSoup(requests.get(url).text)
query = 'doggy'
url = "http://www.bing.com/images/search?q=" + query +
"&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3"
soup = get_soup(url)
但是,以下内容返回一个空列表而不是一个URL列表:
bimg = re.compile("mm.bing.net")
img_links = soup.find_all("img", {"src": bimg})
print img_links
当我print soup.prettify()
时,我可以看到我想要的网址。看起来所有img标签都可能位于脚本中 - 是否可以在BS4中扮演角色而不是看到它们?
以下是一些包含网址的美化输出。
<script type="text/javascript">
//<![CDATA[
var t = '<div class="iol_fp" id="iol_bg"></div><div id="iol_ph"></div><div id="iol_dp"><button id="iol_cls" title="Close"></button><div id="iol_ip"><div id="iol_imp">
<div id="iol_imw"></div><div class="iol_nav" id="iol_navl"></div><div class="iol_nav" id="iol_navr"></div></div><div id="iol_mdb"><span class="iol_mdi" id="iol_md"><span id="iol_mdis"></span><span id="iol_sep">·</span><a id="iol_mdit"></a></span>
<span id="iol_bspan"><button class="iol_mdi" id="iol_pin" href="#" title="Pin to Pinterest"></button><button class="iol_mdi" id="iol_vl" href="#">Show larger</button><button class="iol_mdi" id="iol_vs" href="#">Show smaller</button>
<button class="iol_mdi" id="iol_ss" href="#">Play All</button><button class="iol_mdi" id="iol_sse" href="#">Pause</button></span></div><div id="iol_fsw"><div id="iol_fscb"></div><div id="iol_fsc"></div></div></div><div id="iol_sp"><div id="iol_rs">
<div id="iol_rst">ALSO CONSIDER</div><span id="iol_rsp"><div><div class="iol_rsc"><a href="/images/search?q=Doggy+GIF+Style+1+2+3&Form=IQFRDR" class="iol_rsi" title="Search for: Doggy GIF Style 1 2 3" h="ID=images,5187.2">
<img src="http://ts4.mm.bing.net/th?q=Doggy+GIF+Style+1+2+3&w=50&h=50&c=1&pid=1.7&adlt=moderate"/><span class="iol_rsiq">Doggy<br/><strong>GIF Style 1 2 3</strong></span></a></div><div class="iol_rsc">
<a href="/images/search?q=Puppies&Form=IQFRDR" class="iol_rsi" title="Search for: Puppies" h="ID=images,5189.2"><img src="http://ts1.mm.bing.net/th?q=Puppies&w=50&h=50&c=1&pid=1.7&adlt=moderate"/>
<span class="iol_rsiq"><strong>Puppies</strong></span></a></div><div class="iol_rsc"><a href="/images/search?q=Funny+Doggies&Form=IQFRDR" class="iol_rsi" title="Search for: Funny Doggies" h="ID=images,5191.2">
<img src="http://ts4.mm.bing.net/th?q=Funny+Doggies&w=50&h=50&c=1&pid=1.7&adlt=moderate"/><span class="iol_rsiq"><strong>Funny</strong><br/>Doggies</span></a></div><div class="iol_rsc"><a href="/images/search?q=Doggie+Dentures&Form=IQFRDR" class="iol_rsi" title="Search for: Doggie Dentures" h="ID=images,5193.2">
<img src="http://ts1.mm.bing.net/th?q=Doggie+Dentures&w=50&h=50&c=1&pid=1.7&adlt=moderate"/><span class="iol_rsiq"><strong>Doggie Dentures</strong></span></a></div><div class="iol_rsc">
<a href="/images/search?q=Cute+Doggies&Form=IQFRDR" class="iol_rsi" title="Search for: Cute Doggies" h="ID=images,5195.2"><img src="http://ts3.mm.bing.net/th?q=Cute+Doggies&w=50&h=50&c=1&pid=1.7&adlt=moderate"/>
<span class="iol_rsiq"><strong>Cute</strong><br/>Doggies
非常感谢任何帮助!
答案 0 :(得分:1)
@alecxe走在正确的轨道上 - 这是html5的一个问题。我安装了html5lib
库,以下代码解决了这个问题:
from bs4 import BeautifulSoup
import requests
import html5lib
def get_soup(url):
return BeautifulSoup(requests.get(url).text, 'html5lib')
query = 'doggy'
url = "http://www.bing.com/images/search?q=" + query +
"&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3"
soup = get_soup(url)
感谢您的帮助。
答案 1 :(得分:0)
import urllib, bs4
from bs4 import *
url = "http://www.bing.com/images/search?q=%s&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3" % 'doggy'
html_page = urllib.urlopen(url)
soup = BeautifulSoup(html_page)
links = soup.find_all("img")
img_links = []
for link in links:
img_links.append(str(link.get('src')))
for x in range(0, 10):
for x in range(0, len(img_links)):
try:
if "http://" in img_links[x]:
pass
else:
del img_links[x]
except:
break
试试这个。
链接应位于img_links
列表中。
答案 2 :(得分:0)
from bs4 import BeautifulSoup
import requests
import re
def get_soup(url):
request = requests.get(url).content
return BeautifulSoup(request)
query = 'doggy'
url = "http://www.bing.com/images/search?q=" + query + "&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3"
soup = get_soup(url)
bimg = re.compile('.*mm.bing.net.*')
img_links = soup.find_all("img", {'src': bimg})
for link in img_links:
print link
稍微调整你的正则表达式
<img src="http://ts3.mm.bing.net/th?q=Rabbit&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Cow&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Tiger&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Elephant&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Fish&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts4.mm.bing.net/th?q=Fox&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Animal&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Chicken+Bird&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Domestic+Sheep&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Giraffe&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Puppy&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Dolphin&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Pet&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts4.mm.bing.net/th?q=Baby+Birds&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts4.mm.bing.net/th?q=Labrador+Retriever&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Chihuahua&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Cat&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Lion&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Zebra&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Bulldog&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>