我想使用Google图片搜索下载批量图片。
我的第一种方法;将页面源下载到文件,然后使用open()
打开它可以正常工作,但我希望能够通过运行脚本和更改关键字来获取图像URL。
第一种方法:转到图片搜索(https://www.google.no/search?q=tower&client=opera&hs=UNl&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiM5fnf4_zKAhWIJJoKHYUdBg4Q_AUIBygB&biw=1920&bih=982)。在浏览器中查看页面源并将其保存为html文件。当我然后open()
该脚本的html文件时,脚本按预期工作,我得到一个整洁的列表,其中包含搜索页面上所有图像的网址。这是脚本的第6行(取消注释测试)。
但是,如果我使用requests.get()
函数来解析网页,如脚本的第7行所示,它会获取一个不同的 html文档,该文档不包含完整的网址图像,所以我无法提取它们。
请帮我提取正确的图片网址。
修改:指向tower.html的链接,我正在使用:https://www.dropbox.com/s/yy39w1oc8sjkp3u/tower.html?dl=0
这是我到目前为止编写的代码:
import requests
from bs4 import BeautifulSoup
# define the url to be scraped
url = 'https://www.google.no/search?q=tower&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982'
# top line is using the attached "tower.html" as source, bottom line is using the url. The html file contains the source of the above url.
#page = open('tower.html', 'r').read()
page = requests.get(url).text
# parse the text as html
soup = BeautifulSoup(page, 'html.parser')
# iterate on all "a" elements.
for raw_link in soup.find_all('a'):
link = raw_link.get('href')
# if the link is a string and contain "imgurl" (there are other links on the page, that are not interesting...
if type(link) == str and 'imgurl' in link:
# print the part of the link that is between "=" and "&" (which is the actual url of the image,
print(link.split('=')[1].split('&')[0])
答案 0 :(得分:0)
你知道的是:
# http://www.google.com/robots.txt
User-agent: *
Disallow: /search
我想通过说谷歌严重依赖脚本来解释我的回答。您可能会得到不同的结果,因为您通过reqeusts
请求的页面对页面上提供的script
没有任何作用,而在网页浏览器中加载页面确实
Here's what i get when I request the url you supplied
我从requests.get(url).text
返回的文字在任何地方都不包含'imgurl'
。你的脚本正在寻找它作为其标准的一部分而且它不存在。
但我会看到一堆<img>
个标签,其中src
属性设置为图片网址。如果这就是你所追求的,那就试试这个脚本:
import requests
from bs4 import BeautifulSoup
url = 'https://www.google.no/search?q=tower&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982'
# page = open('tower.html', 'r').read()
page = requests.get(url).text
soup = BeautifulSoup(page, 'html.parser')
for raw_img in soup.find_all('img'):
link = raw_img.get('src')
if link:
print(link)
返回以下结果:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQyxRHrFw0NM-ZcygiHoVhY6B6dWwhwT4va727380n_IekkU9sC1XSddAg
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRfuhcCcOnC8DmOfweuWMKj3cTKXHS74XFh9GYAPhpD0OhGiCB7Z-gidkVk
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSOBZ9iFTXR8sGYkjWwPG41EO5Wlcv2rix0S9Ue1HFcts4VcWMrHkD5y10
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTEAZM3UoqqDCgcn48n8RlhBotSqvDLcE1z11y9n0yFYw4MrUFucPTbQ0Ma
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSJvthsICJuYCKfS1PaKGkhfjETL22gfaPxqUm0C2-LIH9HP58tNap7bwc
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQGNtqD1NOwCaEWXZgcY1pPxQsdB8Z2uLGmiIcLLou6F_1c55zylpMWvSo
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSdRxvQjm4KWaxhAnJx2GNwTybrtUYCcb_sPoQLyAde2KMBUhR-65cm55I
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQLVqQ7HLzD7C-mZYQyrwBIUjBRl8okRDcDoeQE-AZ2FR0zCPUfZwQ8Q20
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQHNByVCZzjSuMXMd-OV7RZI0Pj7fk93jVKSVs7YYgc_MsQqKu2v0EP1M0
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcS_RUkfpGZ1xJ2_7DCGPommRiIZOcXRi-63KIE70BHOb6uRk232TZJdGzc
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSxv4ckWM6eg_BtQlSkFP9hjRB6yPNn1pRyThz3D8MMaLVoPbryrqiMBvlZ
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQWv_dHMr5ZQzOj8Ort1gItvLgVKLvgm9qaSOi4Uomy13-gWZNcfk8UNO8
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRRwzRc9BJpBQyqLNwR6HZ_oPfU1xKDh63mdfZZKV2lo1JWcztBluOrkt_o
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQdGCT2h_O16OptH7OofZHNvtUhDdGxOHz2n8mRp78Xk-Oy3rndZ88r7ZA
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRnmn9diX3Q08e_wpwOwn0N7L1QpnBep1DbUFXq0PbnkYXfO0wBy6fkpZY
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSaP9Ok5n6dL5K1yKXw0TtPd14taoQ0r3HDEwU5F9mOEGdvcIB0ajyqXGE
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTcyaCvbXLYRtFspKBe18Yy5WZ_1tzzeYD8Obb-r4x9Yi6YZw83SfdOF5fm
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTnS1qCjeYrbUtDSUNcRhkdO3fc3LTtN8KaQm-rFnbj_JagQEPJRGM-DnY0
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSiX_elwJQXGlToaEhFD5j2dBkP70PYDmA5stig29DC5maNhbfG76aDOyGh
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQb3ughdUcPUgWAF6SkPFnyiJhe9Eb-NLbEZl_r7Pvt4B3mZN1SVGv0J-s
答案 1 :(得分:0)
您可以使用“ data-src”或“ src”属性来查找属性。
REQUEST_HEADER = {
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
def get_images_new(self, prod_id, name, header, **kw):
i=1
man_code = "apple" #anything you want to search for
url = "https://www.google.com.au/search?q=%s&source=lnms&tbm=isch" % man_code
_logger.info("Subitemsyyyyyyyyyyyyyy: %s" %url)
response = urlopen(Request(url, headers={
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}))
html = response.read().decode('utf-8')
soup = BeautifulSoup(html, "html.parser")
image_elements = soup.find_all("img", {"class": "rg_i Q4LuWd"})
for img in image_elements:
#temp1 = img.get('src')
#_logger.info("11111[%s]" % (temp1))
temp = img.get('data-src')
if temp and i < 7:
image = temp
#_logger.error("11111[%s]" % (image))
filename = str(i)
if filename:
path = "/your/directory/" + str(prod_id) # your filename
if not os.path.exists(path):
os.mkdir(path)
_logger.error("ath.existath.existath.exist[%s]" % (image))
imagefile = open(path + "/" + filename + ".png", 'wb+')
req = Request(image, headers=REQUEST_HEADER)
resp = urlopen(req)
imagefile.write(resp.read())
imagefile.close()
i += 1