Question

我有一个文件要导入到数据库表中，但我想在每一行中都有一个部分。在导入中，我需要为每一行指示偏移量（第一个字节）和长度（字节数）

我有以下文件：

*line_numbers.txt* -> Each row contains the number of 
                      the last row of a record in *plans.txt*.

*plans.txt* ->  All the information required for all the rows.

我有以下代码：

#Starting line number of the record
sLine=0

#Starting byte value of the record
offSet=0

while read line
do
    endByte=`awk -v fline=${sLine} -v lline=${line} \
                 '{if (NR > fline && NR < lline) \
                      sum += length($0); } \
                 END {print sum}' plans.txt`
    echo "\"plans.txt.${offSet}.${endByte}/\"" >> lobs.in
    sLine=$((line+1))
    offSet=$((endByte+offSet))
done < line_numbers.txt

此代码将在 lobs.in 文件中写入类似于：

的内容

"plans.txt.0.504/"
"plans.txt.505.480/"
"plans.txt.984.480/"
"plans.txt.1464.1159/"
"plans.txt.2623.515/"

这意味着，例如，第一条记录从字节 0 开始，并继续下一个 504 字节。下一个从字节 505 开始，然后继续下一个 480 字节。

我仍然需要运行更多测试，但它似乎正在运行。我的问题是我需要处理的卷非常慢。

你有任何表现提示吗？

我找了一种方法在awk中插入循环，但是我需要2个输入文件而且我不知道如何处理它而不用一会儿。

谢谢！

Answer 1

在awk中执行此操作会更快。

假设你有：

$ cat lines.txt
100
200
300
360
10000
50000

和

$ awk -v maxl=50000 'BEGIN{for (i=1;i<=maxl;i++) printf "Line %d\n", i}' >data.txt

（所以你在文件Line 1\nLine 2\n...Line maxl中有data.txt）

您可以这样做：

awk 'FNR==NR{lines[FNR]=$1; next}
            {data[FNR]=length($0); next}
     END{ sl=1
          for (i=1; i in lines; i++) {
               bc=0
               for (j=sl; j<=lines[i]; j++){
                   bc+=data[j]
               }
               printf "line %d to %d is %d bytes\n", sl, j-1, bc
               sl=lines[i]+1
          }    
}' lines.txt data.txt
line 1 to 100 is 1392 bytes
line 101 to 200 is 1500 bytes
line 201 to 300 is 1500 bytes
line 301 to 360 is 900 bytes
line 361 to 10000 is 153602 bytes
line 10001 to 50000 is 680000 bytes

Answer 2

简单改进。切勿使用>>将重定向到循环中，可以使用>>在循环外重定向。更糟的是：

while read line
do
    # .... stuff omitted ... 
    echo "\"plans.txt.${offSet}.${endByte}/\"" >> lobs.in
    # ....
done < line_numbers.txt

注意循环中输出任何内容的唯一行是echo。更好：

while read line
do
    # .... stuff omitted ... 
    echo "\"plans.txt.${offSet}.${endByte}/\""
    # ....
done < line_numbers.txt >> lobs.in

性能：使用AWK循环

2 个答案: