Question

是否有标准功能来检查IRI，检查显然我可以使用的URL：

parts = urlparse.urlsplit(url)  
    if not parts.scheme or not parts.netloc:  
        '''apparently not an url'''

我使用包含Unicode字符的网址尝试了上述内容：

import urlparse
url = "http://fdasdf.fdsfîășîs.fss/ăîăî"
parts = urlparse.urlsplit(url)
if not parts.scheme or not parts.netloc:  
    print "not an url"
else:
    print "yes an url"

我得到的是yes an url。这是否意味着我对这个有效IRI的测试很好？还有另外一种方法吗？

Answer 1

使用urlparse不足以测试有效的IRI。

改为使用rfc3987 package：

from rfc3987 import parse

parse('http://fdasdf.fdsfîășîs.fss/ăîăî', rule='IRI')

Answer 2

the implementation of urlparse中唯一的字符集敏感代码要求该方案只应包含ASCII字母，数字和[+ - 。]字符;否则它完全不可知，所以非ASCII字符可以正常工作。

由于这是non-documented behaviour，你有责任检查它是否仍然如此（在你的项目中有测试），但我不认为它会被改为打破IRI。

urllib提供quoting functions来将IRI转换为ASCII URI，但它们仍未在文档中明确提及IRI，并且在某些情况下它们被破坏：Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?

Python：如何检查字符串是否是有效的IRI？

2 个答案: