使用BadStatusLine: ''
时出现tldextract.extract(url)
错误:
subdomain, domain, tld = tldextract.extract(url)
File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 194, in extract
return TLD_EXTRACTOR(url)
File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 128, in __call__
return self._extract(netloc)
File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 132, in _extract
registered_domain, tld = self._get_tld_extractor().extract(netloc)
File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 165, in _get_tld_extractor
tlds = frozenset(tld for tld_source in tld_sources for tld in tld_source())
File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 165, in <genexpr>
tlds = frozenset(tld for tld_source in tld_sources for tld in tld_source())
File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 204, in _PublicSuffixListSource
page = _fetch_page('http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1')
File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 198, in _fetch_page
return unicode(urllib2.urlopen(url).read(), 'utf-8')
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1180, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
raise BadStatusLine(line)
BadStatusLine: ''
答案 0 :(得分:4)
这是因为您的堆栈跟踪(http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1)中的mozilla.org网址不可用,并且tldextract
尝试在首次安装时从该网址进行更新。可以禁用此实时更新(请参阅下文),但未捕获的异常是tldextract
错误。它应该只记录异常,并无缝地回退到包的捆绑PSL。
已修复tldextract 1.2.1 ,刚刚发布到PyPI。它切换到GitHub mirror of the PSL。所以升级应该解决未捕获的异常。
另一个版本很快将避免未来的未被捕获的例外情况。 GitHub PSL镜像不可用。
您可以通过关闭默认的on-first-install fetch来避免以前版本中出现此问题。使用TLDExtract
构建您自己的fetch=False
callable。来自the docs:
import tldextract
no_fetch_extract = tldextract.TLDExtract(fetch=False)
no_fetch_extract('http://www.google.com')
答案 1 :(得分:2)
该软件包正在尝试从当前不起作用的URL下载公共后缀列表:
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1
这是由于DDOS attack on that URL,Mozilla暂时阻止了该网址。
这有already been reported to the project和fix has been proposed,虽然后者仅在您已有公共后缀列表的缓存副本时才有效。
在此期间,请改用publicsuffix
package;它将数据捆绑在包本身中,不需要URL请求。
更新:Mozilla现在在https://publicsuffix.org/list/effective_tld_names.dat托管文件,对MXR源存储库的任何访问都没有mxr.mozilla.org Referer标题重定向到那个新的位置。
答案 2 :(得分:0)
这是由于http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1没有送达。
如果您想继续使用tldextract来获取子域,域,tld,临时解决方案是使用缓存,例如:在project/tldextractor/__init__.py
import os
import tldextract
TLD_CACHE_PATH = os.path.join(
os.path.abspath(os.path.dirname(__file__)), 'tldextract_cache')
tldextractor = tldextract.TLDExtract(cache_file=TLD_CACHE_PATH, fetch=False)
project/tldextractor/tldextract_cache
:https://gist.github.com/AJamesPhillips/6899560
然后:
from .tldextractor import tldextractor
tldextractor('http://subdomain.domain.tld')