我有以下示例文件,希望删除域名后的所有内容,并将第2列替换为4。
样本文件
one two three www.four.com
abc def ghi www.jkl.com
lion zebra eagle www.fish.com/sardines/shop
house building room https://www.kitchen.co.uk/something/or/other
plane car motorbike http://www.sheep.org/my/farm/yard/
最终结果应该是:
one www.four.com three www.four.com
abc www.jkl.com ghi www.jkl.com
lion www.fish.com eagle www.fish.com/sardines/shop
house www.kitchen.co.uk room https://www.kitchen.co.uk/something/or/other
plane www.sheep.org motorbike http://www.sheep.org/my/farm/yard/
或者,第2列可以只包含domain.com或domain.co.uk。 http,https和www都没有关系。列4不必保留。
感觉我已经很接近了...
awk -F'[ ]' '{gsub(/\/.*/,"",$4); $2=$4; print}' sample
...但是它产生:
one www.four.com three www.four.com
abc www.jkl.com ghi www.jkl.com
lion www.fish.com eagle www.fish.com
house https: room https:
plane http: motorbike http:
任何帮助表示赞赏。
答案 0 :(得分:1)
用斜杠分割URL时,域位于第一部分或第三部分;然后您可以通过检查URL是否具有协议前缀来找出问题所在。因此,这应该可行:
awk '{ split($4,a,/\//); $2=a[a[1]~/^[a-z]+:/?3:1] } 1' file