Python 3.x Beautifulsoup抓取图片网址

时间:2017-04-06 12:33:14

标签: python-3.x beautifulsoup

我尝试使用Python进行图片网址抓取

使用开发工具确认Google图片搜索窗口后,大约有100个图片网址

更多网址会向下滚动。但是,没关系。

问题是我得到的网址只有20个。

我在html文件中打开了一个可寻址的请求。

我确认那里只有20个网址。

我认为请求中只有20个图像URL,因此只输出20个。

如何获取所有图片网址?

这是源代码。

#-*- coding: utf-8 -*-
import urllib.request
from bs4 import BeautifulSoup

if __name__ == "__main__":
    print("Crawling!!!!!!!!!!!!!!!")

    hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0)', 
           'referer' : 'http:google.com',
           'Accept': 'text/html',
           'Accept':'application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept': 'none',
           'Connection': 'keep-alive'}

    inputSearch = "sites:pinterest+white+jeans"
    req = urllib.request.Request("https://www.google.co.kr/searchhl=ko&site=imghp&tbm=isch&source=hp&biw=1600&bih=770&q=" + inputSearch, headers = hdr)
    data = urllib.request.urlopen(req).read()

    bs = BeautifulSoup(data, "html.parser")

    for img in bs.find_all('img'):
        print(img.get('src'))

1 个答案:

答案 0 :(得分:0)

您的链接错误。您可以使用以下代码,看看它是否符合您的需求。

您只需传递一个searchTerm,程序就会打开谷歌页面并获取20张图片的网址。

<强>代码:

def get_images_links(searchTerm):

    import requests
    from bs4 import BeautifulSoup

    searchUrl = "https://www.google.com/search?q={}&site=webhp&tbm=isch".format(searchTerm)
    d = requests.get(searchUrl).text
    soup = BeautifulSoup(d, 'html.parser')

    img_tags = soup.find_all('img')

    imgs_urls = []
    for img in img_tags:
        if img['src'].startswith("http"):
            imgs_urls.append(img['src'])

    return(imgs_urls)

<强>用法:

get_images_links('computer')

<强>输出:

['https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSeq5kKIsOg6zSM2bSrWEnYhpZEpmOYiiLzqf6qfwKzSVUoZ5rHoya75DM',
 'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTBUesIhyt4CgASIUDruqvvMzUBFCuG_iV92NXjZPMtPE5v2G626bge0g0',
 'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRYz8c6LUAiyuAsXkMrOH8DC56aFEMy63m8Fw8-ZdutB5EDpw1hl0y3xg',
 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT33QNycX0Ghqhfqs7Masrk9uvp6d66VlD2djHFfqL4P6phZCJLxkSx0wnt',
 'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRUF11cLRzH2WNfiUJ3WeAOm7Veme0_GLfwoOCs3R5GTQDfcFHMgsNQlyo',
 'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTxcTcv4NPTboVorbD4I-uJbYjY4KjAR5JaMvUXCg33CLDUqop8IufKNw',
 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTU8MkWwhDgcobqn_H2N3SS7dPVwu3I-ki1Sa_4u5YOEt-rAfOk1Kb2jpHO',
 'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQlGu_Y_dhu60UNyilmIUSuOjX5_UnmcWr2AXGJ0w6BmvCXUZissCrtPcw',
 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQN7ItGvBHD1H9EMBC0ZFDMzNu5nt2L-EK1CKmQE4gRNtylalyTTJQxalY',
 'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQyFgwD4Wr20OImzk9Uc0gGGI2-7mYQAU6mJn2GEFkpgLTAqUQUm4KL0TUQwQ',
 'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQR0LFRaUGIadOO5_qolg9ZWegXW0OTghzBf1YzoIhpqkaiY1f3YNx4JnE',
 'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRuOk4nPPPaUdjnZl1pEwGwlfq25GjvZFsshmouB0QaV925KxHg43wJFWc6',
 'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcR5aqLfB9SaFBALzp4Z2qToLeWqeUjqaS3EwNhi6faHRCxYCPMsjhmivKf8',
 'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcR6deLi7H9DCaxJXJyP7lmoixad5Rgo1gBLfVQ35lEWrvpgPoyQJ8CcZ-4',
 'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSPQAfl2WB-AwziLan6NAzvzh2xVDu_XJEcjqSGOdnOJdffo7goWhrFd3wU',
 'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSB3o5cP8DMk9GqT9wpB1N7q6JtREUwitghlXO65UD5s3xCoLj80QuDlzw',
 'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQ18lWMvzZcIZvKI36BUUpnBIaa5e4A3TUAVdxAs6hhJ-rod446dMrPph2V',
 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR8XZhvomXcafQehhetM1_ZXOufBvWmEDAbOsqX-fiU5Xu3U6uWAO3XW-M',
 'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQiWudrcl9y0XbtC19abcPfSwO4N060ipv4znqxnpLYWX5UFO-QdzJatd0r',
 'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQtgqDxef3AOsiyUk0J0MbXgZT8c0JsAW3UpoumSTMFSGXde3BETrGSqw']

修改

如果您想获得超过20个网址,您必须找到一种方法来发送ajax请求以获取页面的其余部分,或者您可以使用selenium来模拟您与网页之间的交互。

我已经使用了第二种方法(可能还有很多其他方法可以做到这一点,如果你愿意,你可以大量优化这些代码):

<强>代码2:

def scrape_all_imgs_google(searchTerm):

    from selenium import webdriver
    from bs4 import BeautifulSoup
    from time import sleep

    def scroll_page():
        for i in range(7):
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            sleep(3)

    def click_button():
        more_imgs_button_xpath = '//*[@id="smb"]'
        driver.find_element_by_xpath(more_imgs_button_xpath).click()   

    def create_soup():
        html_source = driver.page_source
        soup = BeautifulSoup(html_source, 'html.parser')

    def find_imgs():
        imgs_urls = []
        for img in soup.find_all('img'):
            try:
                if img['src'].startswith('http'):
                    imgs_urls.append(img['src'])
            except:
                pass

    #create webdriver
    driver = selenium.webdriver.Chrome()

    #define url using search term
    searchUrl = "https://www.google.com/search?q={}&site=webhp&tbm=isch".format(searchTerm)

    #get url
    driver.get(searchUrl)

    try:
        click_button()
        scroll_page()
    except:
        scroll_page()
        click_button()   

    #create soup only after we loaded all imgs when we scroll'ed the page down
    create_soup()

    #find imgs in soup
    find_imgs()

    #close driver
    driver.close()

    #return list of all img urls found in page
    return imgs_urls    

<强>用法:

urls = scrape_all_imgs_google('computer')

print(len(urls))
print(urls)

<强>输出:

377
['https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcT5Hi9cdE5JPyGl6G3oYfre7uHEie6zM-8q3zQOek0VLqQucGZCwwKGgfoE', 'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcR0tu_xIYB__PVvdH0HKvPd5n1K-0GVbm5PDr1Br9XTyJxC4ORU5e8BVIiF', 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQqHh6ZR6k-7izTfCLFK09Md19xJZAaHbBafCej6S30pkmTOfTFkhhs-Ksn', and etc...

如果您不想使用此代码,可以查看Google Scraper,看看它是否有任何对您有用的方法。