根据模式将一个文件拆分为多个文件

时间:2011-11-09 07:05:51

标签: bash sed split awk

我有一个二进制文件,我使用hexdump和几个awk和sed命令将其转换为常规文件。输出文件看起来像这样 -

$cat temp
3d3d01f87347545002f1d5b2be4ee4d700010100018000cc57e5820000000000000000000
000000087d3f513000000000000000000000000000000000001001001010f000000000026 
58783100b354c52658783100b43d3d0000ad6413400103231665f301010b9130194899f2f
fffffffffff02007c00dc015800a040402802f1d5b2b8ca5674504f433031000000000004
6363070000000000000000000000000065450000b4fb6b4000393d3d1116cdcc57e58287d
3f55285a1084b

临时文件很少有眼睛捕捉器(3d3d),它们不经常重复。他们有点表示新二进制记录的开始。我需要根据那些吸引眼球来分割文件。

我想要的输出是有多个文件(根据我的临时文件中的引人注目的数量)。

所以我的输出看起来像这样 -

$cat temp1
3d3d01f87347545002f1d5b2be4ee4d700010100018000cc57e582000000000000000
0000000000087d3f513000000000000000000000000000000000001001001010f00000000
002658783100b354c52658783100b4

$cat temp2
3d3d0000ad6413400103231665f301010b9130194899f2ffffffffffff02007c00dc0
15800a040402802f1d5b2b8ca5674504f4330310000000000046363070000000000000000
000000000065450000b4fb6b400039

$cat temp3
3d3d1116cdcc57e58287d3f55285a1084b

5 个答案:

答案 0 :(得分:18)

RS中的awk变量对此很好,允许您定义记录分隔符。因此,您只需要在自己的临时文件中捕获每条记录。最简单的版本是:

cat temp |
  awk -v RS="3d3d" '{ print $0 > "temp" NR }' 

示例文本以引人注目的3d3d开头,因此temp1将是一个空文件。此外,引人注目本身不会出现在临时文件的开头,如问题中的临时文件所示。最后,如果有很多记录,您可能会遇到打开文件的系统限制。一些小的并发症会使它更接近你想要的东西并使它更安全:

cat temp |
  awk -v RS="3d3d" 'NR > 1 { print RS $0 > "temp" (NR-1); close("temp" (NR-1)) }' 

答案 1 :(得分:14)

#!/usr/bin/perl

undef $/;
$_ = <>;
$n = 0;

for $match (split(/(?=3d3d)/)) {
      open(O, '>temp' . ++$n);
      print O $match;
      close(O);
}

答案 2 :(得分:5)

这可能有效:

# sed 's/3d3d/\n&/2g' temp | split -dl1 - temp
# ls
temp temp00  temp01  temp02
# cat temp00
3d3d01f87347545002f1d5b2be4ee4d700010100018000cc57e5820000000000000000000000000087d3f513000000000000000000000000000000000001001001010f000000000026 58783100b354c52658783100b4
# cat temp01
3d3d0000ad6413400103231665f301010b9130194899f2ffffffffffff02007c00dc015800a040402802f1d5b2b8ca5674504f4330310000000000046363070000000000000000000000000065450000b4fb6b400039
# cat temp02
3d3d1116cdcc57e58287d3f55285a1084b

编辑:

如果源文件中有换行符,您可以先使用tr -d '\n' <temp删除它们,然后通过上面的sed命令管道输出。如果你想保留它们,那么:

 sed 's/3d3d/\n&/g;s/^\n\(3d3d\)/\1/' temp |csplit -zf temp - '/^3d3d/' {*}

应该做的伎俩

答案 3 :(得分:0)

Mac OS X答案

那种不错的 com.google.gson.JsonSyntaxException: java.lang.IllegalStateException: Expected BEGIN_ARRAY but was STRING at line 1 column 1 path $ 技巧不起作用的地方。这是我的工作:

给出此示例awk -v RS="pattern"

concatted.txt

使用此命令(删除注释以防止失败)

filename=foo bar
foo bar line1
foo bar line2
filename=baz qux
baz qux line1
baz qux line2

分别产生两个名为# cat: useless use of cat ^__^; # tr: replace all newlines with delimiter1 (which must not be in concatted.txt) so we have one line of all the next # sed: replace file start pattern with delimiter2 (which must not be in concatted.txt) so we know where to split out each file # tr: replace delimiter2 with NULL character since sed can't do it # xargs: split giant single-line input on NULL character and pass 1 line (= 1 file) at a time to echo into the pipe # sed: get all but last line (same as head -n -1) because there's an extra since concatted-file.txt ends in a NULL character. # awk: does a bunch of stuff as the final command. Remember it's getting a single line to work with. # {replace all delimiter1s in file with newlines (in place)} # {match regex (sets RSTART and RLENGTH) then set filename to regex match (might end at delimiter1). Note in this case the number 9 is the length of "filename=" and the 2 removes the "§" } # {write file to filename and close the file (to avoid "too many files open" error)} cat ../concatted-file.txt \ | tr '\n' '§' \ | sed 's/filename=/∂filename=/g' \ | tr '∂' '\0' \ | xargs -t -0 -n1 echo \ | sed \$d \ | awk '{match($0, /filename=[^§]+§/)} {filename=substr($0, RSTART+9, RLENGTH-9-2)".txt"} {gsub(/§/, "\n", $0)} {print $0 > filename; close(filename)}' foo bar.txt的文件:

baz qux.txt
filename=foo bar
foo bar line1
foo bar line2


希望这会有所帮助!

答案 4 :(得分:-1)

这取决于它是temp文件中的单行。但假设它是单行,你可以选择:

sed 's/\(.\)\(3d3d\)/\1#\2/g' FILE | awk -F "#" '{ for (i=1; i++; i<=NF) { print $i > "temp" i } }' 

第一个sed插入#作为字段/记录分隔符,然后awk#上拆分并将每个“字段”打印到其自己的文件中。

如果输入文件已经在3d3d上拆分,那么您可以使用:

awk '/^3d3d/ { i++ } { print > "temp" i }' temp

HTH