我有一个这样的文件:
http://article.wn.com/view/2010/11/26/IV_drug_policy_feels_HIV_patients_Red_Cross/ http://aidsjournal.com/,www.cfpa.org.cn/page1/page2 , www.youtube.com
http://seattletimes.nwsource.com/html/jerrybrewer/2013517803_brewer25.html http://www.moortowntoday.co.uk/your-moortown/Yorkshire-Evening-Post-First-for.6038672.jp, www.yorkshireeveningpost.co.uk/business/1/
我想用域
提取网址http://article.wn.com http://aidsjournal.com,www.cfpa.org.cn, www.youtube.com
http://seattletimes.nwsource.com http://www.moortowntoday.co.uk, www.yorkshireeveningpost.co.uk
我使用了这个脚本,但它只给了我一栏中的结果:
sed 's|\(http://[^/]*/\).*|\1|g' file
任何建议都适用于文件中的所有网址。
答案 0 :(得分:1)
通过perl,
$ perl -ple 's/(?:http:\/\/|www\.)[^\/]*\K[^, ]*//g' file
http://article.wn.com http://aidsjournal.com,www.cfpa.org.cn , www.youtube.com
http://seattletimes.nwsource.com http://www.moortowntoday.co.uk, www.yorkshireeveningpost.co.uk
答案 1 :(得分:1)
你可以试试awk:
awk -F/ '{print $1"//"$3}' file
答案 2 :(得分:1)
awk -v FS='[ ,]*' -v OFS=', ' '{ for (i = 1; i <= NF; ++i) { match($i, /^(([[:alpha:]]+:[/][/])?[^/]+)/); $i = substr($i, RSTART, RLENGTH) } print }' file
输出:
http://article.wn.com, http://aidsjournal.com, www.cfpa.org.cn, www.youtube.com
http://seattletimes.nwsource.com, http://www.moortowntoday.co.uk, www.yorkshireeveningpost.co.uk
答案 3 :(得分:0)
改变fesias回答。
awk 'BEGIN{RS="((\n| +),* *|,)";FS="/"}/^http:\/\//{print $1"//"$3;next}{print $1}' file
编辑:没看到cfpa
答案 4 :(得分:0)
如果您实际上并不关心输出中的空格,并且实际上您不希望其中一个URL的末尾有逗号(如果您这样做,我们如何将您想要的逗号分隔开来那些你没有?):
awk -v RS='[[:space:],]+' '{sub(/http:\/\//," "); sub(/\/.*/,""); sub(/ /,"http://")} 1' file
http://article.wn.com
http://aidsjournal.com
www.cfpa.org.cn
www.youtube.com
http://seattletimes.nwsource.com
http://www.moortowntoday.co.uk
www.yorkshireeveningpost.co.uk