Question

我正在编写一个bash脚本来对文件进行一些文本处理。该文件的一部分包含一个必须重新格式化的值网格，因此它具有正确的列数。

这是网格的一个示例，在这种情况下必须格式化为16列;

   702.0   697.0   687.0   685.0   693.0   700.0   693.0   681.0   676.0   684.0
   694.0   700.0   704.0   710.0   710.0   710.0
   711.0   704.0   697.0   690.0   693.0   699.5   696.0   692.0   680.0   687.0
   696.0   705.0   709.0   714.0   716.0   714.0
   722.0   711.0   708.0   700.0   696.0   703.0   701.0   692.0   678.0   684.0
   695.0   707.0   712.0   713.0   716.0   717.0
   727.0   718.0   712.0   707.0   705.0   706.5   701.0   692.0   680.0   683.0
   693.0   706.0   714.0   718.0   720.0   718.0
   732.0   728.0   725.0   718.0   715.0   708.0   699.0   693.0   683.0   681.0
   694.0   703.0   711.0   715.0   723.0   727.0
   738.0   735.0   732.0   721.0   723.0   712.0   702.0   696.0   690.0   681.0
   693.0   701.0   709.0   712.0   720.0   726.0
   736.5   736.5   734.0   728.0   726.5   718.8   714.5   707.5   701.0   687.0
   684.5   695.5   703.0   708.0   716.0   721.5
   736.0   734.0   727.0   726.0   723.0   720.0   723.0   713.0   708.0   699.0
   678.0   686.0   696.0   706.0   712.0   714.0
   729.0   726.0   717.0   716.0   715.0   717.0   720.0   714.0   710.0   700.0
   678.0   679.0   689.0   700.0   702.0   708.0
   722.0   719.0   713.0   709.0   705.0   711.0   719.0   716.0   706.0   697.0
   680.0   679.0   682.0   694.0   698.0   702.0
   712.0   713.0   707.0   704.0   697.0   708.5   719.0   715.0   705.0   693.0
   678.0   680.0   682.0   683.0   685.0   691.0
   707.0   706.0   702.0   693.0   699.0   710.5   712.0   707.0   701.0   687.0
   677.0   687.0   686.0   686.0   680.0   682.0

到目前为止，这是我的脚本，无法按预期执行;

#!/bin/bash

Target=${1:-"grid.dat"}
Outfile="grid.new.dat"
ColumnCount="16"

RawGrid=()
while read line; do
    RawGrid+=($line)
done < <(cat ${Target} )
echo "${#RawGrid[@]} cells found!"

echo "" > ${Outfile}
for (( i=0; i < ${#RawGrid[@]}; i+=1 )); do
    echo -n "   ${RawGrid[$i]}" >> $Outfile
    ((i % ${ColumnCount} == 0)) && (( i > 0 )) && echo "" >> $Outfile # New row
done

我特别坚持的部分是使用正确的列数打印网格。也许我没有正确填充数组？

Answer 1

以下是我如何解决这个问题;

#!/bin/bash

unset IFS  # reset internal field separator

Target=${1:-"grid.dat"}
Outfile="grid.new.dat"
ColumnCount="16"
Separator=" "

Counter=0  # total cell count
while read line; do
    # Default IFS is space, so this "just works"
    for field in $line; do
        Counter=$((Counter+1))
        printf "$field" >> $Outfile
        # If counter is divisible by 16, insert a new line. Otherwise, insert
        # separator
        if [ $((Counter%ColumnCount)) -eq 0 ]; then
            printf "\n" >> $Outfile
        else
            printf "$Separator" >> $Outfile
        fi
    done
done < ${Target}  # no need for cat here
echo "${Counter} cells found!"

我的MBP上的速度比使用原始数据集的问题中的脚本速度慢一点 -

$ command time -l ./original.sh
192 cells found!
        0.03 real         0.01 user         0.00 sys
   2596864  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
      1815  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         3  block output operations
         0  messages sent
         0  messages received
         1  signals received
         2  voluntary context switches
       126  involuntary context switches

$ command time -l ./new.sh
192 cells found!
        0.04 real         0.01 user         0.01 sys
   2555904  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
       644  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         1  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
       333  involuntary context switche

但是，由于它只在内存中存储最新的行+当前字段+计数器，因此它现在可以处理无限数量的行。让我们尝试更大的数据集...

$ for i in {1..1000}; do cat grid.dat >> monster-grid.dat; done
$ command time -l ./new.sh monster-grid.dat
192000 cells found!
       28.88 real        11.38 user        10.02 sys
   2936832  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
       737  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         5  block output operations
         0  messages sent
         0  messages received
         0  signals received
         2  voluntary context switches
    356677  involuntary context switches

$ command time -l ./original.sh monster-grid.dat
192000 cells found!
      266.32 real       222.08 user        13.82 sys
  12320768  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
      4222  page reclaims
         0  page faults
         0  swaps
         0  block input operations
        23  block output operations
         0  messages sent
         0  messages received
         1  signals received
    179607  voluntary context switches
    295379  involuntary context switches

我们现在可以看到两种情况下性能都很差，但是新的确实会节省一些内存，现在实际上比原始脚本更大的结果集更快。但是，正如@Sundeep建议的那样，pr到目前为止 是最好的答案 -

$ command time -l grep -o '[0-9.]\+' monster-grid.dat | pr -16ats > grid.new.dat 0.17 real 0.17 user 0.00 sys 2220032 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 562 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 198 involuntary context switches

bash调整文本网格的大小

1 个答案: