我目前正在处理包含格式化为数据块的文件信息的大型数据集。我试图从文件路径行获取一段数据,并将其作为新行添加到某些行上。数据集包含格式如下的文件信息:
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
45:97:2a:60:e3:69 3208 10
7a:8b:8e:20:7b:38 1982 10
b9:45:3d:f4:97:88 1849 10
Whole File Hash: 865999b40fd9
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
e8:b0:cb:6f:76:ff 1344 10
19:c5:b2:aa:b3:60 613 10
11:7c:7e:76:4b:d5 1272 10
36:e0:59:49:b6:4a 581 10
9c:31:bc:8a:39:94 3296 10
01:f0:56:3a:e1:a9 1140 10
Whole File Hash: 4b28b44ae03d
我想要做的是获取文件类型(在此示例中为.jar和.c)并将其附加到各自的Chunk Hash行,以便最终格式如下:
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
45:97:2a:60:e3:69 3208 10 .jar
7a:8b:8e:20:7b:38 1982 10 .jar
b9:45:3d:f4:97:88 1849 10 .jar
Whole File Hash: 865999b40fd9
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
e8:b0:cb:6f:76:ff 1344 10 .c
19:c5:b2:aa:b3:60 613 10 .c
11:7c:7e:76:4b:d5 1272 10 .c
36:e0:59:49:b6:4a 581 10 .c
9c:31:bc:8a:39:94 3296 10 .c
01:f0:56:3a:e1:a9 1140 10 .c
Whole File Hash: 4b28b44ae03d
我已经有了awk代码来提取文件类型和块哈希行:
awk 'match($0,/\..+/) {print substr($0,RSTART,RLENGTH)}'
awk '/Chunk Hash/{flag=1;next}/Whole File Hash:/{flag=0}flag'
我只是不确定如何使用awk(或sed)连接这些部分,以将文件类型作为新列附加到各自数据块中的每一行。另一件需要注意的事情是,我试图在bash脚本中执行此操作,如果这会产生影响。
答案 0 :(得分:2)
TXR语言的解决方案:
@(repeat)
@ (cases)
File path: @*path.@suff
Inode Num: @inode
@header
@ (collect)
@hashline
@ (last)
Whole File Hash: @wfh
@ (end)
@ (output)
File path: @path.@suff
Inode Num: @inode
@header
@ (repeat)
@{hashline 88}.@suff
@ (end)
Whole File Hash: @wfh
@ (end)
@ (or)
@other
@ (do (put-line other))
@ (end)
@(end)
执行命令
$ txr suffixes.txr data
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
45:97:2a:60:e3:69 3208 10 .jar
7a:8b:8e:20:7b:38 1982 10 .jar
b9:45:3d:f4:97:88 1849 10 .jar
Whole File Hash: 865999b40fd9
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
e8:b0:cb:6f:76:ff 1344 10 .c
19:c5:b2:aa:b3:60 613 10 .c
11:7c:7e:76:4b:d5 1272 10 .c
36:e0:59:49:b6:4a 581 10 .c
9c:31:bc:8a:39:94 3296 10 .c
01:f0:56:3a:e1:a9 1140 10 .c
Whole File Hash: 4b28b44ae03d
答案 1 :(得分:2)
这是一个(GNU)sed解决方案:
/File path:/ { # If line matches "File path:"
h # Copy pattern space to hold space
s/.*(\.[^.]*)$/\1/ # Remove everything but extension from pattern space
x # Swap pattern space and hold space
} # Hold space now contains extension
/Chunk Hash/ { # If line matches "Chunk Hash"
n # Get next line into pattern space
:loop # Anchor for loop
/Whole File Hash/b # If line matches "Whole File Hash", jump out of loop
G # Append extension from hold space to pattern space
s/\n/\t\t\t\t/ # Substitute newline with a bunch of tabs
n # Get next line
b loop # Jump back to ":loop" label
}
这可以存储在单独的文件中(例如,so.sed
),并且必须像
sed -r -f so.sed infile
导致
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
45:97:2a:60:e3:69 3208 10 .jar
7a:8b:8e:20:7b:38 1982 10 .jar
b9:45:3d:f4:97:88 1849 10 .jar
Whole File Hash: 865999b40fd9
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
e8:b0:cb:6f:76:ff 1344 10 .c
19:c5:b2:aa:b3:60 613 10 .c
11:7c:7e:76:4b:d5 1272 10 .c
36:e0:59:49:b6:4a 581 10 .c
9c:31:bc:8a:39:94 3296 10 .c
01:f0:56:3a:e1:a9 1140 10 .c
Whole File Hash: 4b28b44ae03d
非GNU seds必须跳过the usual hoops才能插入标签,并且无法使用-r
选项(但可能-E
,这应该与此相同; {{ 1}}只是为了方便起见而使用-r
)。
答案 2 :(得分:0)
在awk中:
$ cat script.awk
/File path/ {
match($0,/\..+/)
ext=substr($0,RSTART,RLENGTH)
}
/Chunk Hash/ {
flag=1 # flag on
print # print here to...
next # avoid printing ext
}
/Whole File Hash:/ {
flag=0 # flag off
}
flag==1 {
print $0, ext # add space here to your liking, left it short...
next # ... to show output on screen without sidescrolling
} 1 # print non-flagged records
执行命令
$ awk -f script.awk data.txt
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
45:97:2a:60:e3:69 3208 10 .jar
7a:8b:8e:20:7b:38 1982 10 .jar
b9:45:3d:f4:97:88 1849 10 .jar
Whole File Hash: 865999b40fd9
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
e8:b0:cb:6f:76:ff 1344 10 .c
19:c5:b2:aa:b3:60 613 10 .c
11:7c:7e:76:4b:d5 1272 10 .c
36:e0:59:49:b6:4a 581 10 .c
9c:31:bc:8a:39:94 3296 10 .c
01:f0:56:3a:e1:a9 1140 10 .c
Whole File Hash: 4b28b44ae03d
答案 3 :(得分:0)
awk --re-interval '
/^File/{ #If the beginning of line matches "File"
s=gensub("[^.]+(.*)","\\1","1",$0); #Gain the keywords like ".c,.jar" and assign them to s
}
/(..:){3,}/{ #If line matches "**:" three times or more
gsub("[0-9]+$","&\t\t\t\t\t" s,$0) #At the end add s
}
1' file #Print line