图像刮刀:urllib2.URLError:<urlopen error =“”no =“”host =“”given =“”>

时间:2016-07-10 22:06:16

标签: image python-2.7 urllib2

我理解这个错误是因为它没有要求的网址但我无法弄明白为什么。

我的代码是一个4chan img刮刀,它可以在每个板上工作,除了板子“wg”这是壁纸普通板没有问题。由于某种原因,只有在这个板上,它不会去下一页刮图像,它给我错误“urllib2.URLError:”

非常感谢任何帮助,我不知道为什么这个错误只发生在wg上,我想也许它与文件大小有关,但这对错误没有任何意义。

这是我的代码(下方),这里还有我的github的链接:https://github.com/devinatoms/4chanScraper/blob/master/4chanScrape.py

##@author klorox


from bs4 import BeautifulSoup
import requests
import re
import urllib2
import os
import collections

print"""

                )           )           (      *                   (         (                      (     
     )   (   ( /(  (     ( /(           )\ ) (  `   (              )\ )  (   )\ )   (               )\ )  
  ( /(   )\  )\()) )\    )\())         (()/( )\))(  )\ )          (()/(  )\ (()/(   )\          (  (()/(  
  )\())(((_)((_)((((_)( ((_)\           /(_)((_)()\(()/(           /(_)(((_) /(_)((((_)(  `  )  )\  /(_)) 
 ((_)\ )\___ _((_)\ _ )\ _((_)         (_)) (_()((_)/(_))_        (_)) )\___(_))  )\ _ )\ /(/( ((_)(_))   
| | (_((/ __| || (_)_\(_| \| |         |_ _||  \/  (_)) __|       / __((/ __| _ \ (_)_\(_((_)_\| __| _ \  
|_  _| | (__| __ |/ _ \ | .` |          | | | |\/| | | (_ |       \__ \| (__|   /  / _ \ | '_ \| _||   /  
  |_|   \___|_||_/_/ \_\|_|\_|         |___||_|  |_|  \___|       |___/ \___|_|_\ /_/ \_\| .__/|___|_|_\  
                                                                                         |_|              
                    written by klorox, some by Icewave                                                                                                                                                                                                                           
                                                                                         """



# Gather our HTML source code from the pages
def get_soup(url,header):
  return BeautifulSoup(urllib2.urlopen(urllib2.Request(url, headers=header)), 'lxml')

# Main logic function, we use this to re-iterate through the pages
def main(url):
    image_name = "image"
    print url
    header = {'User-Agent': 'Mozilla/5.0'} 
    r = requests.get(url)
    html_content = r.text
    soup = BeautifulSoup(html_content, 'lxml')
    anchors = soup.findAll('a')
    links = [a['href'] for a in anchors if a.has_attr('href')]

# Grabs all the a anchors from the HTML source which contain our image links
    def get_anchors(links):
        for a in anchors:
            links.append(a['href'])
        return links

# Gather the raw links and sort them        
    raw_links = get_anchors(links)
    raw_links.sort()

# Parse out any duplicate links
    def get_duplicates(arr):
        dup_arr = arr[:]
        for i in set(arr):
            dup_arr.remove(i)       
        return list(set(dup_arr))   

# Define our list of new links and call the function to parse out duplicates
    new_elements = get_duplicates(raw_links)

# Get the image links from the raw links, make a request, then write them to a folder.
    def get_img():      
        for element in new_elements:
            if ".jpg" in str(element) or '.png' in str(element) or '.gif' in str(element):
                retries = 0
                passed = False
                while(retries < 3): 
                    try:
                        if "https:" not in element and "http:" not in element:
                            element = "http:"+element           
                        raw_img = urllib2.urlopen(element).read()
                        cntr = len([i for i in os.listdir(dirr) if image_name in i]) + 1
                        print("Saving img: " + str(cntr) + "  :      " + str(element) + " to: "+ dirr )
                        with open(dirr + image_name + "_"+ str(cntr)+".jpg", 'wb') as f:
                            f.write(raw_img)
                        passed = True
                        break
                    except urllib2.URLError, e:
                        retries += 1
                        print "Failed on", element, "(Retrying", retries, ")"
                if not passed:
                    print "Failed on ", element, "skipping..."

# Call our image writing function           
    get_img()

# Ask the user which board they would like to use
print """Boards: [a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vr / w / wg] [i / ic] [r9k] [s4s] [cm / hm / lgbt / y] [3 / aco / adv / an / asp / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / qst / sci / soc / sp / tg / toy / trv / tv / vp / wsg / wsr / x]""" 
print "\n"
board = raw_input("Enter the board letter (Example: b, p, w): ")
dirr = raw_input("Enter the working directory (USE DOUBLE SLASHES): (Example: C:\\\Users\\\Username\\\Desktop\\\Folder\\: ")
# Define our starting page number and first try value           
page = 2
firstTry = True

# Check if this is the first iteration
if firstTry == True:
    url = "http://boards.4chan.org/"+board+"/"
    firstTry = False
    main(url)
    # After first iteration, this loop changes the url after each completed page by calling our main function again each time.
    while page <= 10 and page >= 2 and firstTry == False:
        firstTry == False
        url = "http://boards.4chan.org/"+board+"/"+ str(page) +"/"
        page = page + 1
        p = page - 1
        print("Page: " + str(p))
        main(url)

1 个答案:

答案 0 :(得分:1)

所以没关系,我帮我解决了一下。

解决方案是使用try catch异常并检查http或https,然后适当地重定向URL。该错误可能是由服务器反质量请求预防引起的(假设)。