给出BadStatusLine的python tldextract.extract:''

时间:2013-10-09 10:12:15

标签: python tld

使用BadStatusLine: ''时出现tldextract.extract(url)错误:

subdomain, domain, tld = tldextract.extract(url)
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 194, in extract
    return TLD_EXTRACTOR(url)
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 128, in __call__
    return self._extract(netloc)
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 132, in _extract
    registered_domain, tld = self._get_tld_extractor().extract(netloc)
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 165, in _get_tld_extractor
    tlds = frozenset(tld for tld_source in tld_sources for tld in tld_source())
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 165, in <genexpr>
    tlds = frozenset(tld for tld_source in tld_sources for tld in tld_source())
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 204, in _PublicSuffixListSource
    page = _fetch_page('http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1')
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 198, in _fetch_page
    return unicode(urllib2.urlopen(url).read(), 'utf-8')
  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 400, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 418, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1180, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 407, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
    raise BadStatusLine(line)
BadStatusLine: ''

3 个答案:

答案 0 :(得分:4)

这是因为您的堆栈跟踪(http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1)中的mozilla.org网址不可用,并且tldextract尝试在首次安装时从该网址进行更新。可以禁用此实时更新(请参阅下文),但未捕获的异常是tldextract错误。它应该只记录异常,并无缝地回退到包的捆绑PSL。

已修复tldextract 1.2.1 ,刚刚发布到PyPI。它切换到GitHub mirror of the PSL。所以升级应该解决未捕获的异常。

另一个版本很快将避免未来的未被捕获的例外情况。 GitHub PSL镜像不可用。

关闭默认提取

您可以通过关闭默认的on-first-install fetch来避免以前版本中出现此问题。使用TLDExtract构建您自己的fetch=False callable。来自the docs

import tldextract
no_fetch_extract = tldextract.TLDExtract(fetch=False)
no_fetch_extract('http://www.google.com')

答案 1 :(得分:2)

该软件包正在尝试从当前不起作用的URL下载公共后缀列表:

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

这是由于DDOS attack on that URL,Mozilla暂时阻止了该网址。

这有already been reported to the projectfix has been proposed,虽然后者仅在您已有公共后缀列表的缓存副本时才有效。

在此期间,请改用publicsuffix package;它将数据捆绑在包本身中,不需要URL请求。

更新:Mozilla现在在https://publicsuffix.org/list/effective_tld_names.dat托管文件,对MXR源存储库的任何访问都没有mxr.mozilla.org Referer标题重定向到那个新的位置。

答案 2 :(得分:0)

这是由于http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1没有送达。

如果您想继续使用tldextract来获取子域,域,tld,临时解决方案是使用缓存,例如:在project/tldextractor/__init__.py

import os 
import tldextract
TLD_CACHE_PATH = os.path.join(
    os.path.abspath(os.path.dirname(__file__)), 'tldextract_cache')
tldextractor = tldextract.TLDExtract(cache_file=TLD_CACHE_PATH, fetch=False)

project/tldextractor/tldextract_cachehttps://gist.github.com/AJamesPhillips/6899560

然后:

from .tldextractor import tldextractor
tldextractor('http://subdomain.domain.tld')