Question

我有这样的文件：

Timestamp       URL                    Text                    
1331635241000   http://example.com     Peoples footage at www.test.com,http://example4.com
1331635231000   http://example1.net    crack the nuts http://example6.com   
1331635280000   http://example2.net    Loving this

每列都以制表符分隔。如果没有URL，我只需要从第2列和第3列中提取URL，然后将其留空，例如得到如下结果：

URL                    Text
http://example.com     www.test.com,http://example4.com 
http://example1.net    http://example6.com
http://example2.net

我试过这个剧本

awk 'BEGIN {FS="\t"} {print $2,$3}' file | grep -oP '(((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]'

此脚本可以在没有标题的单个列中提供所有URL。任何解决此问题的建议。

Answer 1

只需在一个awk脚本中完成所有操作：

awk '
BEGIN{ FS=OFS="\t" }
NR==1 { print $2, $3; next }
{
    urls = ""
    while ( match($3,/((https?|ftp|gopher|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)/) ) {
        urls = (urls ? urls "," : "") substr($3,RSTART,RLENGTH)
        $3 = substr($3,RSTART+RLENGTH)
    }
    print $2, urls
}
' file
URL     Text
http://example.com      www.test.com,http://example4.com
http://example1.net     http://example6.com
http://example2.net

我不相信您的RE匹配网址完全准确，您可能希望再次查看。

从文本文件中提取包含URL的列

1 个答案: