Python Link Puller

时间:2013-01-23 22:01:50

标签: python html hyperlink beautifulsoup

所以我成功使用了这个python脚本:

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('https://conceled:conceled@traveler.pha.phila.gov:8443/servlet/traveler')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_key('href'):
        print link['href']

从网站上取下链接。它适用于几乎任何其他网站,但在尝试上述(我需要工作的那个,我得到一些错误:)

Traceback (most recent call last):
  File "C:\Users\joe\Desktop\PHA\AndroidPhones\androidphonescript2.py", line 5, in <module>
    status, response = http.request('https://conceled@traveler.pha.phila.gov:8443/servlet/traveler')
  File "C:\Python27\lib\httplib2.py", line 608, in request
    (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cacheFullPath)
  File "C:\Python27\lib\httplib2.py", line 449, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "C:\Python27\lib\httplib2.py", line 427, in _conn_request
    conn.connect()
  File "C:\Python27\lib\httplib.py", line 1157, in connect
    self.timeout, self.source_address)
  File "C:\Python27\lib\socket.py", line 553, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno 11003] getaddrinfo failed

1 个答案:

答案 0 :(得分:1)

该网站的证书无效,但这似乎不会导致问题。您使用的是什么版本的httplib2?我刚刚安装了当前版本0.7.7,我得到了更好的异常文本:

  

文件“d:\ Python27 \ lib \ site-packages \ httplib2-0.7.7-py2.7.egg \ httplib2__init __。py”,第1287行,在_conn_request中       引发ServerNotFoundError(“无法在%s找到服务器”%conn.host)   ServerNotFoundError:无法在conceled上找到服务器:conceled@traveler.pha.phila.gov

因此它不会将//username:password@解析为用户名和密码。 Httplib2 documentation表示凭证应通过以下方式提供:

Http.add_credentials(name, password[, domain=None])

所以试试:

http = httplib2.Http()
http.add_credentials(name, password)
status, response = http.request('https://traveler.pha.phila.gov:8443/servlet/traveler')

我在网站上没有帐户,因此无法测试。

如果您需要能够在URL中支持用户名和密码,则必须编写代码以自行解析。使用正则表达式(Python re模块)不应该太难。