正则表达式或仅获取图像网址的方法

时间:2013-08-21 08:46:09

标签: python regex web-scraping beautifulsoup

我想从以下页面http://wordpandit.com/learning-bin/visual-vocabulary/page/2/下载图片 我使用urllib下载它并使用BeautifulSoup解析。它包含许多网址,我只想要那些以.jpg结尾的网址,它们还有rel =“prettyPhoto [gallery]”标记。 如何使用Beautifulsoup做到这一点? 例如链接http://wordpandit.com/wp-content/uploads/2013/02/Obliterate.jpg

#http://wordpandit.com/learning-bin/visual-vocabulary/page/2/
import urllib
import BeautifulSoup
import lxml
baseurl='http://wordpandit.com/learning-bin/visual-vocabulary/page/'
count=2


for count in range(1,2):
    url=baseurl+count+'/'
    soup1=BeautifulSoup.BeautifulSoup(urllib2.urlopen(url))#read will not be needed
    #find all links to imgs
    atag=soup.findAll(rel="prettyPhoto[gallery]")
    for tag in atag:
        soup2=BeautifulSoup.BeautifulSoup(tag)
        imgurl=soup2.find(href).value
        urllib2.urlopen(imgurl)

1 个答案:

答案 0 :(得分:0)

你的代码有很多不必要的东西。也许您稍后会使用它们,但是将count指定为2然后将其用作for range循环中的计数器之类的操作则毫无意义。以下代码将执行您想要的操作:

import urllib2
from bs4 import BeautifulSoup
baseurl='http://wordpandit.com/learning-bin/visual-vocabulary/page/'

for count in range(1,2):
    url = baseurl + str(count) + "/"
    html_page = urllib2.urlopen(url)
    soup = BeautifulSoup(html_page)
    atag = soup.findAll(rel="prettyPhoto[gallery]", href = True)
    for tag in atag:
        if tag['href'].endswith(".jpg"):
            imgurl = tag['href']
            img = urllib2.urlopen("http://wordpandit.com" + imgurl)
            with open(imgurl.split("/")[-1], "wb") as local_file:
                local_file.write(img.read())