我想从以下页面http://wordpandit.com/learning-bin/visual-vocabulary/page/2/下载图片 我使用urllib下载它并使用BeautifulSoup解析。它包含许多网址,我只想要那些以.jpg结尾的网址,它们还有rel =“prettyPhoto [gallery]”标记。 如何使用Beautifulsoup做到这一点? 例如链接http://wordpandit.com/wp-content/uploads/2013/02/Obliterate.jpg
#http://wordpandit.com/learning-bin/visual-vocabulary/page/2/
import urllib
import BeautifulSoup
import lxml
baseurl='http://wordpandit.com/learning-bin/visual-vocabulary/page/'
count=2
for count in range(1,2):
url=baseurl+count+'/'
soup1=BeautifulSoup.BeautifulSoup(urllib2.urlopen(url))#read will not be needed
#find all links to imgs
atag=soup.findAll(rel="prettyPhoto[gallery]")
for tag in atag:
soup2=BeautifulSoup.BeautifulSoup(tag)
imgurl=soup2.find(href).value
urllib2.urlopen(imgurl)
答案 0 :(得分:0)
你的代码有很多不必要的东西。也许您稍后会使用它们,但是将count
指定为2
然后将其用作for range
循环中的计数器之类的操作则毫无意义。以下代码将执行您想要的操作:
import urllib2
from bs4 import BeautifulSoup
baseurl='http://wordpandit.com/learning-bin/visual-vocabulary/page/'
for count in range(1,2):
url = baseurl + str(count) + "/"
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)
atag = soup.findAll(rel="prettyPhoto[gallery]", href = True)
for tag in atag:
if tag['href'].endswith(".jpg"):
imgurl = tag['href']
img = urllib2.urlopen("http://wordpandit.com" + imgurl)
with open(imgurl.split("/")[-1], "wb") as local_file:
local_file.write(img.read())