Question

假设我有这些网址。

http://abdd.eesfea.domainname.com/b/33tA$/0021/file
http://mail.domainname.org/abc/abc/aaa
http://domainname.edu

我只想提取“domainame.com”或“domainname.org”或“domainname.edu”。我怎么能这样做？

我想，我需要在“com | org | edu ...”之前找到最后一个“点”，然后将这个“点”前一个点的内容打印到这个点的下一个点（如果有的话）

需要有关常规表达的帮助。非常感谢！！！我正在使用Python。

Answer 1

为什么要使用正则表达式？

http://docs.python.org/library/urlparse.html

Answer 2

如果你想进入正则表达式路线......

RFC-3986是关于URI的权限。 Appendix B提供了这个正则表达式，将其分解为其组件：

re_3986 = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
# Where:
# scheme    = $2
# authority = $4
# path      = $5
# query     = $7
# fragment  = $9

这是一个增强的，Python友好版本，它使用命名捕获组。它在工作脚本中的函数中显示：

import re

def get_domain(url):
    """Return top two domain levels from URI"""
    re_3986_enhanced = re.compile(r"""
        # Parse and capture RFC-3986 Generic URI components.
        ^                                    # anchor to beginning of string
        (?:  (?P<scheme>    [^:/?#\s]+):// )?  # capture optional scheme
        (?:(?P<authority>  [^/?#\s]*)  )?  # capture optional authority
             (?P<path>        [^?#\s]*)      # capture required path
        (?:\?(?P<query>        [^#\s]*)  )?  # capture optional query
        (?:\#(?P<fragment>      [^\s]*)  )?  # capture optional fragment
        $                                    # anchor to end of string
        """, re.MULTILINE | re.VERBOSE)
    re_domain =  re.compile(r"""
        # Pick out top two levels of DNS domain from authority.
        (?P<domain>[^.]+\.[A-Za-z]{2,6})  # $domain: top two domain levels.
        (?::[0-9]*)?                      # Optional port number.
        $                                 # Anchor to end of string.
        """, 
        re.MULTILINE | re.VERBOSE)
    result = ""
    m_uri = re_3986_enhanced.match(url)
    if m_uri and m_uri.group("authority"):
        auth = m_uri.group("authority")
        m_domain = re_domain.search(auth)
        if m_domain and m_domain.group("domain"):
            result = m_domain.group("domain");
    return result

data_list = [
    r"http://abdd.eesfea.domainname.com/b/33tA$/0021/file",
    r"http://mail.domainname.org/abc/abc/aaa",
    r"http://domainname.edu",
    r"http://domainname.com:80",
    r"http://domainname.com?query=one",
    r"http://domainname.com#fragment",
    r"www.domainname.com#fragment",
    r"https://domainname.com#fragment",
    ]
cnt = 0
for data in data_list:
    cnt += 1
    print("Data[%d] domain = \"%s\"" %
        (cnt, get_domain(data)))

有关根据RFC-3986挑选和验证URI的更多信息，您可能需要查看我一直在处理的文章：Regular Expression URI Validation

Answer 3

除了Jase的回答。如果您不想使用urlparse，只需拆分URL。

协议条带（http：//或https：//）您只需在第一次出现'/'时拆分字符串。这会给你带来类似的东西：第二个URL上的'mail.domainname.org'。然后可以用“。”拆分。你只需按[-2]

从列表中选择最后两个

这将始终产生domainname.org或其他。如果您正确地剥离了协议，并且该URL有效。

我只会使用urlparse，但可以这样做。关于正则表达式的Dunno，但我就是这样做的。

Answer 4

如果您需要比urlparse提供的更多灵活性，以下是一个让您入门的示例：

import re
def getDomain(url):
    #requires 'http://' or 'https://'
    #pat = r'(https?):\/\/(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'
    #'http://' or 'https://' is optional
    pat = r'((https?):\/\/)?(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'
    m = re.match(pat, url)
    if m:
        domain = m.group('domain')
        return domain
    else:
        return False

我使用命名组(?P<domain>\w+)来抓取匹配项，然后按名称m.group('domain')对其进行索引。学习正则表达式的好处是，一旦你对它们感到满意，解决最复杂的解析问题就相对简单了。如果有必要，可以改进这种模式或多或少的宽容 - 例如，如果你传递'http://123.45.678.90'，这个模式将返回'678'，但是应该可以在任何其他URL上运行得很好想出来。 Regexr是学习和测试正则表达式的绝佳资源。

使用正则表达式提取域

4 个答案: