Question

我正在尝试匹配网址中的一个部分。此网址已处理完毕，仅包含域名。

例如：

我现在的网址是business.time.com 现在我想摆脱顶级域名（.com）。我想要的结果是business.time

我使用以下代码：

gawk'{
match($1, /[a-zA-Z0-9\-\.]+[^(.com|.org|.edu|.gov|.mil)]/, where)
print where[0]
print where[1]
}' test

在测试中，有四行：

business.time.com
mybest.try.com
this.is.a.example.org
this.is.another.example.edu

我在期待这个：

business.time

mybest.try

this.is.a.example

this.is.another.example

然而，输出是

business.t

mybest.try

this.is.a.examp

this.is.another.examp

谁能告诉我什么是错的，我该怎么办？

由于

Answer 1

为什么不将点用作字段分隔符并执行： awk -F. 'sub(FS $NF,x)' test

或使用更易读的更像rev test|cut -d. -f 2-|rev的内容。

Answer 2

你可以这样做：

rev domains.txt | cut -d '.' -f 2- | rev

但是如果要删除更复杂的终结点，可以使用带有显式列表的sed：

sed -r 's/\.(com(\.hk)?|org|edu|net|gov|mil)//' domains.txt

Answer 3

问题是[^]仅用于排除单个字符，而不是表达式，所以你基本上有一个正则表达式：

match($1, /[a-zA-Z0-9\-\.]+[^()|.cedgilmoruv)]/, where)

这就是匹配来自ime.com的{{1}}的原因，因为所有这些字符都在[^]表达式中。

我找不到一个很好的负面匹配gawk，但确实构建了下面的内容，我希望对你有用：

buisiness.time.com

所以第一部分最终在[1]和哪里[2]具有高级域

match($1, /([a-zA-Z0-9\-\.]+)(\.com|\.org|\.edu|\.gov|\.mil)/, where)
print where[0]
print where[1]
print where[2]
> }' test