Badlinestatus:|使用http.client的Python - 适用于某些网站但不适用于其他网站

时间:2013-08-16 12:47:25

标签: python html httpconnection

import http.client
import csv

def http_get(url, path, headers):
    try:
        conn = http.client.HTTPConnection(url)
        print ('Connecting to ' + url)
        conn.request(url, path, headers=headers)
        resp = conn.getresponse()
        if resp.status<=400:
            body = resp.read()
            print ('Reading Source...')
    except Exception as e:
        raise Exception('Connection Error: %s' % e)
        pass
    finally:
        conn.close()
        print ('Connection Closed')

    if resp.status >= 400:
        print (url)
        raise ValueError('Response Error: %s, %s, URL: %s' % (resp.status, resp.reason,url))
    return body


with open('domains.csv','r') as csvfile:
    urls = [row[0] for row in csv.reader(csvfile)]

L = ['Version 0.7','Version 1.2','Version 1.5','Version 2.0','Version 2.1','Version 2.3','Version 2.5','Version 2.6','Version 2.7','Version 2.8','Version 2.9','Version 2.9','Version 3.0','Version 3.1','Version 3.2','Version 3.3','Version 3.4','Version 3.5.1','Version 3.5.2']
PATH = '/'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
HEADERS = {'User-Agent': user_agent}

for url in urls:        
    HOST = url

    print ('Testing WordPress Installation on ' + url)
    http_get(HOST,PATH,HEADERS)

我现在已经看了一两个星期,我发现了类似的错误,但是我不明白为什么它适用于csv文件中的某些网站,而不是其他网站。我检查了服务器,我看到它默认丢弃ICMP数据包所以我改变了,现在traceroute和ping都得到100%收到而不是之前的100%丢失。我认为它是相关的,因为该主机上的所有站点都有相同的问题。但是我的脚本仍然抛出异常:

mud@alex-BBVM:~/Desktop/scripts$ python3 httpTest.py
Testing WordPress Installation on XXXXX.ie
Connecting to exsite.ie
Reading Source...
Connection Closed
Testing WordPress Installation on AAAAAA.com
Connecting to AAAAA.com
Reading Source...
Connection Closed
Testing WordPress Installation on YYYYY.ie
Connecting to YYYYY.ie
Reading Source...
Connection Closed
Testing WordPress Installation on CCCCC.ie
Connecting to CCCCCC.ie
Reading Source...
Connection Closed
Testing WordPress Installation on DDDDDDD.ie
Connecting to DDDDDDD.ie
Connection Closed
Traceback (most recent call last):
  File "httpTest.py", line 9, in http_get
    resp = conn.getresponse()
  File "/usr/lib/python3.2/http/client.py", line 1049, in getresponse
    response.begin()
  File "/usr/lib/python3.2/http/client.py", line 346, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.2/http/client.py", line 328, in _read_status
    raise BadStatusLine(line)
http.client.BadStatusLine: <html>


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "httpTest.py", line 38, in <module>
    http_get(HOST,PATH,HEADERS)
  File "httpTest.py", line 14, in http_get
    raise Exception('Connection Error: %s' % e)
Exception: Connection Error: <html>

我显然用占位符替换了网址,因为它们是客户端地址,我宁愿不在这里发帖。

无论如何,任何见解或帮助都表示赞赏。

我已经阅读了http.client的文档,但这是相关的例外情况,但我似乎无法从我所欣赏的内容中提取解决方案。

谢谢!

1 个答案:

答案 0 :(得分:0)

首先,我建议您在调用HTTPResponse之前始终从conn.close()对象中读取所有内容。甚至404回复都包含一份文件。

我对你的追溯感到困惑,据我所知http.client.BadStatusLine应隐藏except Exception

通常情况下,except Exception子句并不是一个好主意,除非你重新引发相同的异常(你不是),否则你可能会掩盖潜在的问题。无论如何,当代码没有按预期工作时,它应该是第一件事。

此外,您提供的输出似乎与您提供的代码不符。

具体来说,根据追溯:

Connection Closed
Traceback (most recent call last):
  File "httpTest.py", line 9, in http_get
    resp = conn.getresponse()

此代码之前有一个print ('Connecting to ' + url)

print ('Connecting to ' + url)
conn.request(url, path, headers=headers)
resp = conn.getresponse()

但是输出中回溯之前的行是Connection Closed


<强>更新

忽略try / finally

的令人困惑的执行顺序 当初始响应不像http.client.BadStatusLine时,

HTTP/1.1 200 OK会被提升。在这种特殊情况下,它是<html>

服务器正在返回没有HTTP标头的文档。或者这是代码中的意外行为。

我重复我已经说过的话:始终从HTTPResponse对象中读取所有内容。

数据包捕获将确认该服务器的线路是什么。