将URL与wordlist中的单词组合时的URLOpen错误

时间:2015-02-24 10:16:04

标签: python urllib host

嘿伙计们正在制作一个Python Webcrawler。所以我有一个链接,最后的字符是:"搜索?q ="然后我使用我之前加载的单词列表到列表中。但是,当我尝试使用:urllib2.urlopen(url)打开它时,它会抛出一个错误(urlopen错误没有主机给出)。但是当我正常打开urllib的链接时(所以输入通常自动粘贴的单词)它就可以了。所以你能告诉我为什么会这样吗?

谢谢和问候

完整错误:



  File "C:/Users/David/PycharmProjects/GetAppResults/main.py", line 61, in <module>
    getResults()
  File "C:/Users/David/PycharmProjects/GetAppResults/main.py", line 40, in getResults
    usock = urllib2.urlopen(url)
  File "C:\Python27\lib\urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 402, in open
    req = meth(req)
  File "C:\Python27\lib\urllib2.py", line 1113, in do_request_
    raise URLError('no host given')
urllib2.URLError: <urlopen error no host given>
&#13;
&#13;
&#13;

代码:

&#13;
&#13;
with open(filePath, "r") as ins:
    wordList = []
    for line in ins:
        wordList.append(line)

def getResults():
packageID = ""
count = 0
word = "Test"
for x in wordList:
    word = x;
    print word
    url = 'http://www.example.com/search?q=' + word
    usock = urllib2.urlopen(url)
    page_source = usock.read()
    usock.close()
    print page_source
    startSequence = "data-docid=\""
    endSequence = "\""
    while page_source.find(startSequence) != -1:
        start = page_source.find(startSequence) + len(startSequence)
        end = page_source.find(endSequence, start)
        print str(start);
        print str(end);
        link = page_source[start:end]
        print link
        if link:
            if not link in packageID:
                packageID += link + "\r\n"
                print packageID
        page_source = page_source[end + len(endSequence):]
count+=1
&#13;
&#13;
&#13;

因此,当我打印字符串字时,它会从字列表中输出正确的字

2 个答案:

答案 0 :(得分:0)

我解决了问题。我只是现在使用urllib而不是urllib2,任何工作都很好,谢谢大家:)

答案 1 :(得分:-1)

  

请注意,urlopen()会返回响应,而不是请求。

您的代理配置可能已损坏;验证您的代理是否正常工作:

print(urllib.request.getproxies())

或完全绕过代理支持:

url = urllib.request.urlopen(
    "http://www.example.com/search?q="+text_to_check
    proxies={})

将URL与Wordlist中的单词组合的示例方法。它结合了列表单词以从网址获取图像并下载它。循环它以访问您的整个列表。

import urllib
import re
print "The URL crawler starts.."

mylist =["http://www.ebay","https://www.npmjs.org/"]
wordlist = [".com","asss"]

x = 1
urlcontent = urllib.urlopen(mylist[0]+wordlist[0]).read()
imgUrls = re.findall('img .*?src="(.*?)"',urlcontent)

for imgUrl in imgUrls:
    img = imgUrl
    print img
    urllib.urlretrieve(img,str(x)+".jpg")
    x= x + 1

希望这会有所帮助,否则请发布您的代码和错误日志。