DNS查找失败:未找到地址'your.proxy.com':[Errno -5]没有与主机名关联的地址

时间:2013-07-29 06:27:46

标签: python dns scrapy linkedin web-crawler

此问题是此处已解决问题的扩展,即。在使用scrapy进行身份验证时爬行linkedin。 Crawling LinkedIn while authenticated with Scrapy @Gates

虽然我保持脚本的基础相同,但只添加我自己的session_key和session_password - 并且在特定于我的用例后更改了启动URL,如下所示。

class LinkedPySpider(InitSpider):
    name = 'Linkedin'
    allowed_domains = ['linkedin.com']
    login_page = 'https://www.linkedin.com/uas/login'
    start_urls=["http://www.linkedin.com/nhome/"]

[Also tried with this start url] 
start_urls =
["http://www.linkedin.com/profile/view?id=38210724&trk=nav_responsive_tab_profile"]

我也尝试将start_url更改为第二个(已注释),看看我是否可以从我自己的个人资料页面开始抓取,我无法这样做。

**Error that I get** - 
scrapy crawl Linkedin
**2013-07-29 11:37:10+0530 [Linkedin] DEBUG: Retrying <GET http://www.linkedin.com/nhome/> (failed 1 times): DNS lookup failed: address 'your.proxy.com' not found: [Errno -5] No address associated with hostname.**


**To see if the Name space was resolved, I tried -:**
nslookup www.linkedin.com #works
nslookup www.linkedin.com/uas/login # I think the depth of pages within a main website, does not resolve, and is normal right ?

Then I also tried to see if the error could have been due to Name Server not resolving and appended the Nameservers as below .
echo $http_proxy #gives http://username:password@your.proxy.com:80
sudo vi /etc/resolv.conf
and appended the free fast dns nameservers IP address as follows to this file :
nameserver 208.67.222.222
nameserver 208.67.220.220
nameserver 202.51.5.52

我对NS冲突和DNS查找失败不太好,但这可能是因为我在虚拟机中 - 尽管其他抓取项目似乎工作得很好?

我的基本用例是能够提取连接和他们工作的公司列表,以及一堆其他属性。所以,我想从主配置文件页面中的“连接”(全部)抓取/分页,如果我在start_url中使用公共配置文件,则不显示,即。 scrapy shell http://www.linkedin.com/in/ektagrover 在通过hxs.select传递合法的XPath时 - 这似乎有效,但是如果我将它与蜘蛛一起使用,那就不行了,因为它不符合我的基础用例(如下所示)

问题:我的start_url是否有问题,或者只是我“假设发布身份验证后start_page可能会出现在 ANY 网页上在该网站中,当我将身份验证重定向到“https://www.linkedin.com/uas/login

工作环境 - 我使用ubuntu 12.04 LTS和Python 2.7.3进行Oracle VM Virtual Box,Scrapy 0.14.4

什么工作/答案 - 看起来我的代理服务器指向错误 echo $ http_proxy - 提供http://username:password@your.proxy.com:80 [取消设置环境变量$ http_proxy]刚做了“http_proxy =”,它取消了代理的设置,然后回复了$ http_proxy,它给出了null来确认。刚发布scrapy crawl Linkedin的帖子,通过身份验证模块工作。虽然我在这里和那里都被卡住了,但那是另一个问题。谢谢@warwaruk

1 个答案:

答案 0 :(得分:1)

**Error that I get** - 
scrapy crawl Linkedin
**2013-07-29 11:37:10+0530 [Linkedin] DEBUG: Retrying <GET http://www.linkedin.com/nhome/> (failed 1 times): DNS lookup failed: address 'your.proxy.com' not found: [Errno -5] No address associated with hostname.**


**To see if the Name space was resolved, I tried -:**
nslookup www.linkedin.com #works
nslookup www.linkedin.com/uas/login # I think the depth of pages within a main website, does not resolve, and is normal right ?

Then I also tried to see if the error could have been due to Name Server not resolving and appended the Nameservers as below .
echo $http_proxy #gives http://username:password@your.proxy.com:80

您有一个代理集:http://username:password@your.proxy.com:80

显然,它在互联网上不存在:

$ nslookup your.proxy.com
Server:         127.0.1.1
Address:        127.0.1.1#53

** server can't find your.proxy.com: NXDOMAIN

取消设置环境变量$http_proxy或设置代理并更改env。因此变量。