Question

我目前正在使用以下脚本加载URL列表，然后检查每个URL的来源以获取错误字符串列表。如果在源中找不到错误字符串，则该URL被视为有效并写入文本文件。

如何修改此脚本以检查HTTP状态？如果URL返回404，则会被忽略，如果返回200，则URL将被写入文本文件。任何帮助将不胜感激。

import urllib2
import sys

error_strings = ['invalid product number', 'specification not available. please contact   customer services.']

def check_link(url):
if not url:
    return False
f = urllib2.urlopen(url)    
html = f.read()
result = False
if html:
    result = True
    html = html.lower()
    for s in error_strings:
        if s in html:
            result = False
            break
return result


if __name__ == '__main__':
if len(sys.argv) == 1:
    print 'Usage: %s <file_containing_urls>' % sys.argv[0]
else:
    output = open('valid_links.txt', 'w+')
    for url in open(sys.argv[1]):
        if(check_link(url.strip())):
            output.write('%s\n' % url.strip());
    output.flush()
    output.close()

Answer 1

您可以稍微改变对urlopen的来电：

>>> try:
...     f = urllib2.urlopen(url)
... except urllib2.HTTPError, e:
...     print e.code
...
404

利用e.code，您可以检查它是否为404。如果您没有点击except区块，则可以像现在一样使用f。

Answer 2

urlib2.urlopen使用其他方法返回类似文件的对象，其中一个：getcode()是您正在寻找的内容，只需添加一行：

if f.getcode() != 200:
    return False

在相关地方

Answer 3

试试这个。你可以用这个

 def check_link(url):
        if not url:
            return False
        code = None
        try:
            f = urllib2.urlopen(url)
            code = f.getCode()
        except urllib2.HTTPError, e:
            code = e.code
        result = True
        if code != 200:
            result = False
        return result

或者，如果您只需要维护一个无效代码字符串列表并检查它们，那么它将如下所示。

def check_link(url):
    if not url:
        return False
    code = None
    try:
        f = urllib2.urlopen(url)
        code = f.getCode()
    except urllib2.HTTPError, e:
        code = e.code

    result = True
    if code in invalid_code_strings:
         result = False

    return result

如何修改此脚本以检查HTTP状态（404,200）

3 个答案: