Question

所以我想只匹配以太网中的域名：

http://www.google.com/test/
http://google.com/test/
http://google.net/test/

输出应该适用于所有3：谷歌

我的代码只适用于.com

echo "http://www.google.com/test/" | sed -n "s/.*www\.\(.*\)\.com.*$/\1/p"
Output: 'google'

然后我认为这就像说（com | net）一样简单，但这似乎不是真的：

echo "http://www.google.com/test/" | sed -n "s/.*www\.\(.*\)\.(com|net).*$/\1/p"
Output: '' (nothing)

我打算使用类似的方法摆脱“www”，但似乎我做错了...（它是否与\（\）之外的正则表达式无效...）

Answer 1

如果你有Python，你可以使用urlparse模块

import urlparse
for http in open("file"):
    o = urlparse.urlparse(http)
    d = o.netloc.split(".")
    if "www" in o.netloc:
        print d[1]
    else:
        print d[0]

输出

$ cat file
http://www.google.com/test/
http://google.com/test/
http://google.net/test/

$ ./python.py
google
google
google

或者您可以使用awk

awk -F"/" '{
    gsub(/http:\/\/|\/.*$/,"")
    split($0,d,".")
    if(d[1]~/www/){
        print d[2]
    }else{
        print d[1]
    }
} ' file

$ cat file
http://www.google.com/test/
http://google.com/test/
http://google.net/test/
www.google.com.cn/test
google.com/test

$ ./shell.sh
google
google
google
google
google

Answer 2

在所有情况下都会输出“google”：

sed -n "s|http://\(.*\.\)*\(.*\)\..*|\2|p"

修改

此版本将处理“'http://google.com.cn/test”和“http://www.google.co.uk/”等网址以及原始问题中的网址：

sed -nr "s|http://(www\.)?([^.]*)\.(.*\.?)*|\2|p"

此版本将处理不包含“http：//”（以及其他）的案例：

sed -nr "s|(http://)?(www\.)?([^.]*)\.(.*\.?)*|\3|p"

Answer 3

s|http://(www\.)?([^.]*)|$2|

它是带有备用分隔符的Perl（因为它使它更清晰），我相信你可以把它移植到sed或你需要的任何东西。

Answer 4

您是否尝试在sed命令上使用“-r”开关？这启用了扩展的正则表达式模式（egrep兼容的正则表达式）。

编辑：尝试这个，它似乎工作。 com | net前面的“？：”字符是为了防止这组字符被周围的括号捕获。

 echo "http://www.google.com/test/" | sed -nr "s/.*www\.(.*)\.(?:com|net).*$/\1/p"

Answer 5

#! /bin/bash

urls=(                        \
  http://www.google.com/test/ \
  http://google.com/test/     \
  http://google.net/test/     \
)

for url in ${urls[@]}; do
  echo $url | sed -re 's,^http://(.*\.)*(.+)\.[a-z]+/.+$,\2,'
done

匹配来自网址的域名（www.google.com = google）

5 个答案: