AWK删除域和替换列之后的所有内容

时间:2019-12-09 17:26:16

标签: shell awk

我有以下示例文件,希望删除域名后的所有内容,并将第2列替换为4。

样本文件

one two three www.four.com
abc def ghi www.jkl.com
lion zebra eagle www.fish.com/sardines/shop
house building room https://www.kitchen.co.uk/something/or/other
plane car motorbike http://www.sheep.org/my/farm/yard/

最终结果应该是:

one www.four.com three www.four.com
abc www.jkl.com ghi www.jkl.com
lion www.fish.com eagle www.fish.com/sardines/shop
house www.kitchen.co.uk room https://www.kitchen.co.uk/something/or/other
plane www.sheep.org motorbike http://www.sheep.org/my/farm/yard/

或者,第2列可以只包含domain.com或domain.co.uk。 http,https和www都没有关系。列4不必保留。

感觉我已经很接近了...

awk -F'[ ]' '{gsub(/\/.*/,"",$4); $2=$4; print}' sample

...但是它产生:

one www.four.com three www.four.com
abc www.jkl.com ghi www.jkl.com
lion www.fish.com eagle www.fish.com
house https: room https:
plane http: motorbike http:

任何帮助表示赞赏。

1 个答案:

答案 0 :(得分:1)

用斜杠分割URL时,域位于第一部分或第三部分;然后您可以通过检查URL是否具有协议前缀来找出问题所在。因此,这应该可行:

awk '{ split($4,a,/\//); $2=a[a[1]~/^[a-z]+:/?3:1] } 1' file
相关问题