我正在编写一个bash脚本来对文件进行一些文本处理。该文件的一部分包含一个必须重新格式化的值网格,因此它具有正确的列数。
这是网格的一个示例,在这种情况下必须格式化为16列;
702.0 697.0 687.0 685.0 693.0 700.0 693.0 681.0 676.0 684.0
694.0 700.0 704.0 710.0 710.0 710.0
711.0 704.0 697.0 690.0 693.0 699.5 696.0 692.0 680.0 687.0
696.0 705.0 709.0 714.0 716.0 714.0
722.0 711.0 708.0 700.0 696.0 703.0 701.0 692.0 678.0 684.0
695.0 707.0 712.0 713.0 716.0 717.0
727.0 718.0 712.0 707.0 705.0 706.5 701.0 692.0 680.0 683.0
693.0 706.0 714.0 718.0 720.0 718.0
732.0 728.0 725.0 718.0 715.0 708.0 699.0 693.0 683.0 681.0
694.0 703.0 711.0 715.0 723.0 727.0
738.0 735.0 732.0 721.0 723.0 712.0 702.0 696.0 690.0 681.0
693.0 701.0 709.0 712.0 720.0 726.0
736.5 736.5 734.0 728.0 726.5 718.8 714.5 707.5 701.0 687.0
684.5 695.5 703.0 708.0 716.0 721.5
736.0 734.0 727.0 726.0 723.0 720.0 723.0 713.0 708.0 699.0
678.0 686.0 696.0 706.0 712.0 714.0
729.0 726.0 717.0 716.0 715.0 717.0 720.0 714.0 710.0 700.0
678.0 679.0 689.0 700.0 702.0 708.0
722.0 719.0 713.0 709.0 705.0 711.0 719.0 716.0 706.0 697.0
680.0 679.0 682.0 694.0 698.0 702.0
712.0 713.0 707.0 704.0 697.0 708.5 719.0 715.0 705.0 693.0
678.0 680.0 682.0 683.0 685.0 691.0
707.0 706.0 702.0 693.0 699.0 710.5 712.0 707.0 701.0 687.0
677.0 687.0 686.0 686.0 680.0 682.0
到目前为止,这是我的脚本,无法按预期执行;
#!/bin/bash
Target=${1:-"grid.dat"}
Outfile="grid.new.dat"
ColumnCount="16"
RawGrid=()
while read line; do
RawGrid+=($line)
done < <(cat ${Target} )
echo "${#RawGrid[@]} cells found!"
echo "" > ${Outfile}
for (( i=0; i < ${#RawGrid[@]}; i+=1 )); do
echo -n " ${RawGrid[$i]}" >> $Outfile
((i % ${ColumnCount} == 0)) && (( i > 0 )) && echo "" >> $Outfile # New row
done
我特别坚持的部分是使用正确的列数打印网格。也许我没有正确填充数组?
答案 0 :(得分:0)
以下是我如何解决这个问题;
#!/bin/bash
unset IFS # reset internal field separator
Target=${1:-"grid.dat"}
Outfile="grid.new.dat"
ColumnCount="16"
Separator=" "
Counter=0 # total cell count
while read line; do
# Default IFS is space, so this "just works"
for field in $line; do
Counter=$((Counter+1))
printf "$field" >> $Outfile
# If counter is divisible by 16, insert a new line. Otherwise, insert
# separator
if [ $((Counter%ColumnCount)) -eq 0 ]; then
printf "\n" >> $Outfile
else
printf "$Separator" >> $Outfile
fi
done
done < ${Target} # no need for cat here
echo "${Counter} cells found!"
我的MBP上的速度比使用原始数据集的问题中的脚本速度慢一点 -
$ command time -l ./original.sh
192 cells found!
0.03 real 0.01 user 0.00 sys
2596864 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
1815 page reclaims
0 page faults
0 swaps
0 block input operations
3 block output operations
0 messages sent
0 messages received
1 signals received
2 voluntary context switches
126 involuntary context switches
$ command time -l ./new.sh
192 cells found!
0.04 real 0.01 user 0.01 sys
2555904 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
644 page reclaims
0 page faults
0 swaps
0 block input operations
1 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
333 involuntary context switche
但是,由于它只在内存中存储最新的行+当前字段+计数器,因此它现在可以处理无限数量的行。让我们尝试更大的数据集...
$ for i in {1..1000}; do cat grid.dat >> monster-grid.dat; done
$ command time -l ./new.sh monster-grid.dat
192000 cells found!
28.88 real 11.38 user 10.02 sys
2936832 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
737 page reclaims
0 page faults
0 swaps
0 block input operations
5 block output operations
0 messages sent
0 messages received
0 signals received
2 voluntary context switches
356677 involuntary context switches
$ command time -l ./original.sh monster-grid.dat
192000 cells found!
266.32 real 222.08 user 13.82 sys
12320768 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
4222 page reclaims
0 page faults
0 swaps
0 block input operations
23 block output operations
0 messages sent
0 messages received
1 signals received
179607 voluntary context switches
295379 involuntary context switches
我们现在可以看到两种情况下性能都很差,但是新的确实会节省一些内存,现在实际上比原始脚本更大的结果集更快。但是,正如@Sundeep建议的那样,pr到目前为止 是最好的答案 -
$ command time -l grep -o '[0-9.]\+' monster-grid.dat | pr -16ats > grid.new.dat
0.17 real 0.17 user 0.00 sys
2220032 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
562 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
198 involuntary context switches