使用Python3刮擦Google图像(要求+ BeautifulSoup)

时间:2016-02-16 17:25:29

标签: python html web-scraping google-image-search

我想使用Google图片搜索下载批量图片。

我的第一种方法;将页面源下载到文件,然后使用open()打开它可以正常工作,但我希望能够通过运行脚本和更改关键字来获取图像URL。

第一种方法:转到图片搜索(https://www.google.no/search?q=tower&client=opera&hs=UNl&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiM5fnf4_zKAhWIJJoKHYUdBg4Q_AUIBygB&biw=1920&bih=982)。在浏览器中查看页面源并将其保存为html文件。当我然后open()该脚本的html文件时,脚本按预期工作,我得到一个整洁的列表,其中包含搜索页面上所有图像的网址。这是脚本的第6行(取消注释测试)。

但是,如果我使用requests.get()函数来解析网页,如脚本的第7行所示,它会获取一个不同的 html文档,该文档不包含完整的网址图像,所以我无法提取它们。

请帮我提取正确的图片网址。

修改:指向tower.html的链接,我正在使用:https://www.dropbox.com/s/yy39w1oc8sjkp3u/tower.html?dl=0

这是我到目前为止编写的代码:

import requests
from bs4 import BeautifulSoup

# define the url to be scraped
url = 'https://www.google.no/search?q=tower&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982'

# top line is using the attached "tower.html" as source, bottom line is using the url. The html file contains the source of the above url.
#page = open('tower.html', 'r').read()
page = requests.get(url).text

# parse the text as html
soup = BeautifulSoup(page, 'html.parser')

# iterate on all "a" elements.
for raw_link in soup.find_all('a'):
   link = raw_link.get('href')
   # if the link is a string and contain "imgurl" (there are other links on the page, that are not interesting...
   if type(link) == str and 'imgurl' in link:
        # print the part of the link that is between "=" and "&" (which is the actual url of the image,
        print(link.split('=')[1].split('&')[0])

2 个答案:

答案 0 :(得分:0)

你知道的是:

# http://www.google.com/robots.txt

User-agent: *
Disallow: /search


我想通过说谷歌严重依赖脚本来解释我的回答。您可能会得到不同的结果,因为您通过reqeusts请求的页面对页面上提供的script没有任何作用,而在网页浏览器中加载页面确实

Here's what i get when I request the url you supplied

我从requests.get(url).text返回的文字在任何地方都不包含'imgurl'。你的脚本正在寻找它作为其标准的一部分而且它不存在。

但我会看到一堆<img>个标签,其中src属性设置为图片网址。如果这就是你所追求的,那就试试这个脚本:

import requests
from bs4 import BeautifulSoup

url = 'https://www.google.no/search?q=tower&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982'

# page = open('tower.html', 'r').read()
page = requests.get(url).text

soup = BeautifulSoup(page, 'html.parser')

for raw_img in soup.find_all('img'):
  link = raw_img.get('src')
  if link:
    print(link)

返回以下结果:

https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQyxRHrFw0NM-ZcygiHoVhY6B6dWwhwT4va727380n_IekkU9sC1XSddAg
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRfuhcCcOnC8DmOfweuWMKj3cTKXHS74XFh9GYAPhpD0OhGiCB7Z-gidkVk
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSOBZ9iFTXR8sGYkjWwPG41EO5Wlcv2rix0S9Ue1HFcts4VcWMrHkD5y10
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTEAZM3UoqqDCgcn48n8RlhBotSqvDLcE1z11y9n0yFYw4MrUFucPTbQ0Ma
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSJvthsICJuYCKfS1PaKGkhfjETL22gfaPxqUm0C2-LIH9HP58tNap7bwc
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQGNtqD1NOwCaEWXZgcY1pPxQsdB8Z2uLGmiIcLLou6F_1c55zylpMWvSo
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSdRxvQjm4KWaxhAnJx2GNwTybrtUYCcb_sPoQLyAde2KMBUhR-65cm55I
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQLVqQ7HLzD7C-mZYQyrwBIUjBRl8okRDcDoeQE-AZ2FR0zCPUfZwQ8Q20
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQHNByVCZzjSuMXMd-OV7RZI0Pj7fk93jVKSVs7YYgc_MsQqKu2v0EP1M0
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcS_RUkfpGZ1xJ2_7DCGPommRiIZOcXRi-63KIE70BHOb6uRk232TZJdGzc
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSxv4ckWM6eg_BtQlSkFP9hjRB6yPNn1pRyThz3D8MMaLVoPbryrqiMBvlZ
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQWv_dHMr5ZQzOj8Ort1gItvLgVKLvgm9qaSOi4Uomy13-gWZNcfk8UNO8
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRRwzRc9BJpBQyqLNwR6HZ_oPfU1xKDh63mdfZZKV2lo1JWcztBluOrkt_o
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQdGCT2h_O16OptH7OofZHNvtUhDdGxOHz2n8mRp78Xk-Oy3rndZ88r7ZA
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRnmn9diX3Q08e_wpwOwn0N7L1QpnBep1DbUFXq0PbnkYXfO0wBy6fkpZY
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSaP9Ok5n6dL5K1yKXw0TtPd14taoQ0r3HDEwU5F9mOEGdvcIB0ajyqXGE
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTcyaCvbXLYRtFspKBe18Yy5WZ_1tzzeYD8Obb-r4x9Yi6YZw83SfdOF5fm
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTnS1qCjeYrbUtDSUNcRhkdO3fc3LTtN8KaQm-rFnbj_JagQEPJRGM-DnY0
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSiX_elwJQXGlToaEhFD5j2dBkP70PYDmA5stig29DC5maNhbfG76aDOyGh
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQb3ughdUcPUgWAF6SkPFnyiJhe9Eb-NLbEZl_r7Pvt4B3mZN1SVGv0J-s

答案 1 :(得分:0)

您可以使用“ data-src”或“ src”属性来查找属性。


REQUEST_HEADER = {
    'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}

def get_images_new(self, prod_id, name, header, **kw):
        i=1
        man_code = "apple" #anything you want to search for
        url = "https://www.google.com.au/search?q=%s&source=lnms&tbm=isch" % man_code
        _logger.info("Subitemsyyyyyyyyyyyyyy: %s" %url)
        response = urlopen(Request(url, headers={
            'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}))
        html = response.read().decode('utf-8')
        soup = BeautifulSoup(html, "html.parser")
        image_elements = soup.find_all("img", {"class": "rg_i Q4LuWd"})
        for img in image_elements:
            #temp1 = img.get('src')
            #_logger.info("11111[%s]" % (temp1))
            temp = img.get('data-src')
            if temp and i < 7:
                image = temp
                #_logger.error("11111[%s]" % (image))
                filename = str(i)
                if filename:
                    path = "/your/directory/" + str(prod_id) # your filename
                    if not os.path.exists(path):
                        os.mkdir(path)
                    _logger.error("ath.existath.existath.exist[%s]" % (image))
                    imagefile = open(path + "/" + filename + ".png", 'wb+')
                    req = Request(image, headers=REQUEST_HEADER)
                    resp = urlopen(req)
                    imagefile.write(resp.read())
                    imagefile.close()
                i += 1