Question

我目前正在处理包含格式化为数据块的文件信息的大型数据集。我试图从文件路径行获取一段数据，并将其作为新行添加到某些行上。数据集包含格式如下的文件信息：

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
45:97:2a:60:e3:69               3208                    10
7a:8b:8e:20:7b:38               1982                    10
b9:45:3d:f4:97:88               1849                    10
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
e8:b0:cb:6f:76:ff               1344                    10
19:c5:b2:aa:b3:60               613                     10
11:7c:7e:76:4b:d5               1272                    10
36:e0:59:49:b6:4a               581                     10
9c:31:bc:8a:39:94               3296                    10
01:f0:56:3a:e1:a9               1140                    10
Whole File Hash: 4b28b44ae03d

我想要做的是获取文件类型（在此示例中为.jar和.c）并将其附加到各自的Chunk Hash行，以便最终格式如下：

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)       
45:97:2a:60:e3:69               3208                    10                              .jar
7a:8b:8e:20:7b:38               1982                    10                              .jar
b9:45:3d:f4:97:88               1849                    10                              .jar
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)       
e8:b0:cb:6f:76:ff               1344                    10                              .c
19:c5:b2:aa:b3:60               613                     10                              .c
11:7c:7e:76:4b:d5               1272                    10                              .c
36:e0:59:49:b6:4a               581                     10                              .c
9c:31:bc:8a:39:94               3296                    10                              .c
01:f0:56:3a:e1:a9               1140                    10                              .c
Whole File Hash: 4b28b44ae03d

我已经有了awk代码来提取文件类型和块哈希行：

awk 'match($0,/\..+/) {print substr($0,RSTART,RLENGTH)}'

awk '/Chunk Hash/{flag=1;next}/Whole File Hash:/{flag=0}flag'

我只是不确定如何使用awk（或sed）连接这些部分，以将文件类型作为新列附加到各自数据块中的每一行。另一件需要注意的事情是，我试图在bash脚本中执行此操作，如果这会产生影响。

Answer 1

TXR语言的解决方案：

@(repeat)
@  (cases)
File path: @*path.@suff
Inode Num: @inode
@header
@    (collect)
@hashline
@    (last)
Whole File Hash: @wfh
@    (end)
@    (output)
File path: @path.@suff
Inode Num: @inode
@header
@      (repeat)
@{hashline 88}.@suff
@      (end)
Whole File Hash: @wfh
@    (end)
@  (or)
@other
@  (do (put-line other))
@  (end)
@(end)

执行命令

$ txr suffixes.txr data
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
45:97:2a:60:e3:69               3208                    10                              .jar
7a:8b:8e:20:7b:38               1982                    10                              .jar
b9:45:3d:f4:97:88               1849                    10                              .jar
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
e8:b0:cb:6f:76:ff               1344                    10                              .c
19:c5:b2:aa:b3:60               613                     10                              .c
11:7c:7e:76:4b:d5               1272                    10                              .c
36:e0:59:49:b6:4a               581                     10                              .c
9c:31:bc:8a:39:94               3296                    10                              .c
01:f0:56:3a:e1:a9               1140                    10                              .c
Whole File Hash: 4b28b44ae03d

Answer 2

这是一个（GNU）sed解决方案：

/File path:/ {         # If line matches "File path:"
    h                  # Copy pattern space to hold space
    s/.*(\.[^.]*)$/\1/ # Remove everything but extension from pattern space
    x                  # Swap pattern space and hold space
}                      # Hold space now contains extension
/Chunk Hash/ {         # If line matches "Chunk Hash"
    n                  # Get next line into pattern space
    :loop              # Anchor for loop
    /Whole File Hash/b # If line matches "Whole File Hash", jump out of loop
    G                  # Append extension from hold space to pattern space
    s/\n/\t\t\t\t/     # Substitute newline with a bunch of tabs
    n                  # Get next line
    b loop             # Jump back to ":loop" label
}

这可以存储在单独的文件中（例如，so.sed），并且必须像

一样调用

sed -r -f so.sed infile

导致

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
45:97:2a:60:e3:69               3208                    10                              .jar
7a:8b:8e:20:7b:38               1982                    10                              .jar
b9:45:3d:f4:97:88               1849                    10                              .jar
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
e8:b0:cb:6f:76:ff               1344                    10                              .c
19:c5:b2:aa:b3:60               613                     10                              .c
11:7c:7e:76:4b:d5               1272                    10                              .c
36:e0:59:49:b6:4a               581                     10                              .c
9c:31:bc:8a:39:94               3296                    10                              .c
01:f0:56:3a:e1:a9               1140                    10                              .c
Whole File Hash: 4b28b44ae03d

非GNU seds必须跳过the usual hoops才能插入标签，并且无法使用-r选项（但可能-E，这应该与此相同; {{ 1}}只是为了方便起见而使用-r）。

Answer 3

在awk中：

$ cat script.awk
/File path/ { 
    match($0,/\..+/)
    ext=substr($0,RSTART,RLENGTH)
} 
/Chunk Hash/ {
    flag=1            # flag on
    print             # print here to...
    next              # avoid printing ext
} 
/Whole File Hash:/ {  
    flag=0            # flag off
} 
flag==1 {
    print $0, ext     # add space here to your liking, left it short...
    next              # ... to show output on screen without sidescrolling
} 1                   # print non-flagged records

执行命令

$ awk -f script.awk data.txt
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
45:97:2a:60:e3:69               3208                    10 .jar
7a:8b:8e:20:7b:38               1982                    10 .jar
b9:45:3d:f4:97:88               1849                    10 .jar
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
e8:b0:cb:6f:76:ff               1344                    10 .c
19:c5:b2:aa:b3:60               613                     10 .c
11:7c:7e:76:4b:d5               1272                    10 .c
36:e0:59:49:b6:4a               581                     10 .c
9c:31:bc:8a:39:94               3296                    10 .c
01:f0:56:3a:e1:a9               1140                    10 .c
Whole File Hash: 4b28b44ae03d

Answer 4

awk  --re-interval '
/^File/{                                 #If the beginning of line matches "File"
    s=gensub("[^.]+(.*)","\\1","1",$0);  #Gain the keywords like ".c,.jar" and assign them to s
} 
/(..:){3,}/{                             #If line matches "**:" three times or more
    gsub("[0-9]+$","&\t\t\t\t\t" s,$0)   #At the end add s
}
1' file                                  #Print line

使用awk或sed

4 个答案: