Question

我有这些网址

http://www.domain.co.uk&affiliate=adwords&ved=0CPsCENEM
http://www.domain.co.uk:affiliate=adwords&ved=0CPsCENEM
http://www.domain.co.uk]affiliate=adwords&ved=0CPsCENEM
http://www.domain.com[affiliate=adwords&ved=0CPsCENEM

即使我在TLD之后有任何角色，我如何从这些网址获取域名？

目前我正在使用以下正则表达式，但只有在TLD之后我才能使用/

https?:\/\/(?!.*https?:\/\/)(?:www\.)([\da-z\.-]+)\.([a-z\.]{2,9})

Answer 1

你可以使用python的urlparse。

import urlparse
s = urlparse.urlsplit('http://www.domain.co.uk&affiliate=adwords&ved=0CPsCENEM').netloc
ind = 0
parts = s.split('.')
if 'www' in parts:
    ind = parts.index('www') + 1
print parts[ind]

Answer 2

在评论中，您告诉您正在使用Ruby。将网址存储在urls.txt中后，您可以按照以下示例操作：

File.open("urls.txt", "r") do |file_handle|
    file_handle.each_line do |url|
        url =~ /^[^:]+:\/\/((\.?[a-z0-9]+)+)/
        domain = $1
        print "#{domain}\n"
    end 
end

<强>解释

正则表达式基于以下事实：您可能想到的任何分隔符必须至少遵循一个规则：它是域或主机名中不允许的字符。域名或主机名中允许的字符为[0-9a-z-]。（注意，也允许使用unicode字符，到目前为止我在答案中并不关心这个事实）

^              Matches the start of the string
[^:]           Character class. Matches any character except from `:`
+              The previous match needs to occur 1 or more times
:\/\/          The :// after the url protocol
(              Start of outer matching group for the whole domain ($1)
(              Begin of inner matching group. Matches sub domain
\.?            A literal dot. Optionally
[a-z0-9-]+     Sub domain, host name or TLD. At least one character
)              End of inner matching group
+              Endless sub domains but at least one host name are allowed
)              End of outer matching group

域名将通过第一个捕获组$1提供。

第一个答案

这取决于正则表达式引擎。

以下正则表达式可与perl兼容的正则表达式（pcre）一起使用：

grep -ioP '^[^:]+://\K(\.?[a-z0-9]+)+'

扩展POSIX正则表达式和awk后，您可以使用：

awk -F'(://|[^0-9a-zA-Z.])' '{print $2}'

...

Answer 3

这应该有效：

://.*?(\w+)([^\w.]|$)

使用比赛的第1组。

请参阅demo

从任何网址获取域名

3 个答案: