我有一个很大的.warc文件,其中包含大量记录。我想在bash脚本中从中提取标题。
我们来看看。
文件如下所示:
WARC/1.0
WARC-Type: response
Content-Length: 2597724
WARC-Date: 2016-05-07T03:36:46Z
WARC-Payload-Digest: sha1:33a3973a118293e4f8831449cc37095d645a57b3
WARC-Target-URI: url
Content-Type: application/http; msgtype=response
WARC-Record-ID: <urn:uuid:ecc531d0-1404-11e6-a7dc-002590c8c43c>
<!DOCTYPE html>
//some html code
WARC/1.0
WARC-Type: response
Content-Length: 2106841
WARC-Date: 2016-05-07T03:36:51Z
WARC-Payload-Digest: sha1:826fcc2ef666e2cfbcff9e4329a293141077a20e
WARC-Target-URI: url
Content-Type: application/http; msgtype=response
WARC-Record-ID: <urn:uuid:efa655dc-1404-11e6-a7dc-002590c8c43c>
<!DOCTYPE html>
//some html code
etc...
我想要只提取标题信息并像这样输出(.csv文件 - 标题中的每个信息作为一列):
WARC-Type(from first header)\tContent-Length(from first header)\tWARC-Date(from first header)\tWARC-Payload-Digest(from first header)\tWARC-Target-URI(from first header)\tContent-Type(from first header)\tWARC-Record-ID
WARC-Type(from second header)\tContent-Length(from second header)\tWARC-Date(from second header)\tWARC-Payload-Digest(from second header)\tWARC-Target-URI(from second header)\tContent-Type(from second header)\tWARC-Record-ID
我制作了匹配此标题的正则表达式:
REGULAR_EXPRESSION='WARC\/1\.0\nWARC-Type\:.*\nWARC-Date\:.*\nWARC-Payload-Digest:.*\nWARC-Target-URI:.*\nWARC-Record-ID:.*\n\n'
我不能将grep与-P参数一起使用。所以我不知道如何继续。也许是sed?匹配正则表达式后的下一个问题。如何提取适当的信息?
实现目标的最佳方式是什么?
答案 0 :(得分:1)
使用awk
处理此问题更容易:
awk -F ': ' -v OFS='\t' 'NF>=2 {
printf "%s%s", $2, ($1 != "WARC-Record-ID" ? OFS : ORS)}' file
response 2597724 2016-05-07T03:36:46Z sha1:33a3973a118293e4f8831449cc37095d645a57b3 url application/http; msgtype=response <urn:uuid:ecc531d0-1404-11e6-a7dc-002590c8c43c>
response 2106841 2016-05-07T03:36:51Z sha1:826fcc2ef666e2cfbcff9e4329a293141077a20e url application/http; msgtype=response <urn:uuid:efa655dc-1404-11e6-a7dc-002590c8c43c>
答案 1 :(得分:0)
awk 解决方案:
awk -F': ' '/WARC-Type/{n=NR+6}NR<=n{ s="\t"; if(NR==n){n=0;s=ORS} printf "%s%s",$2,s }' file
输出:
response 2597724 2016-05-07T03:36:46Z sha1:33a3973a118293e4f8831449cc37095d645a57b3 url application/http; msgtype=response <urn:uuid:ecc531d0-1404-11e6-a7dc-002590c8c43c>
response 2106841 2016-05-07T03:36:51Z sha1:826fcc2ef666e2cfbcff9e4329a293141077a20e url application/http; msgtype=response <urn:uuid:efa655dc-1404-11e6-a7dc-002590c8c43c>
答案 2 :(得分:0)
您的问题非常清楚,您的示例并没有告诉我们任何可以帮助我们避免错误匹配的内容(任何匹配脚本中最难的部分),但这是您正在尝试做的事情吗?
$ awk -v RS= -v FS='\n[^:]+: *' -v OFS='\t' 'sub(/^WARC\/[0-9.]+/,""){$1=$1; sub(OFS,""); print}' file
response 2597724 2016-05-07T03:36:46Z sha1:33a3973a118293e4f8831449cc37095d645a57b3 url application/http; msgtype=response <urn:uuid:ecc531d0-1404-11e6-a7dc-002590c8c43c>
response 2106841 2016-05-07T03:36:51Z sha1:826fcc2ef666e2cfbcff9e4329a293141077a20e url application/http; msgtype=response <urn:uuid:efa655dc-1404-11e6-a7dc-002590c8c43c>