Question

考虑一个纯文本文件，其中包含破坏页面的ASCII控制字符＆＃34; Form Feed＆＃34; （$＆＃39; \ F＆＃39）：

alpha\n
beta\n
gamma\n\f
one\n
two\n
three\n
four\n
five\n\f
earth\n
wind\n
fire\n
water\n\f

请注意，每个页面都有一个随机数行。

需要一个bash例程，它从包含破坏页面的ASCII控制字符的文本文件中返回给定行号的页码。

经过很长一段时间研究解决方案后，我终于遇到了这段代码：

function get_page_from_line
{
    local nline="$1"
    local input_file="$2"

    local npag=0
    local ln=0
    local total=0

    while IFS= read -d $'\f' -r page; do

        npag=$(( ++npag ))

        ln=$(echo -n "$page" | wc -l)

        total=$(( total + ln ))

        if [ $total -ge $nline ]; then
            echo "${npag}"
            return
        fi

    done < "$input_file"

    echo "0"

    return
}

但不幸的是，在某些情况下，这种解决方案证明非常慢。

有更好的解决方案吗？

谢谢！

Answer 1

awk救援！

awk -v RS='\f' -v n=09 '$0~"^"n"." || $0~"\n"n"." {print NR}' file

3

更新锚定，如下所示。

 $ for i in $(seq -w 12); do awk -v RS='\f' -v n="$i" 
          '$0~"^"n"." || $0~"\n"n"." {print n,"->",NR}' file; done

01 -> 1
02 -> 1
03 -> 1
04 -> 2
05 -> 2
06 -> 2
07 -> 2
08 -> 2
09 -> 3
10 -> 3
11 -> 3
12 -> 3

Answer 2

类似长度的脚本可以用bash本身编写，以定位和响应文件中包含的嵌入式<form-feed>。（它也适用于POSIX shell，替代字符串索引和expr用于数学）例如，

#!/bin/bash

declare -i ln=1     ## line count
declare -i pg=1     ## page count

fname="${1:-/dev/stdin}"            ## read from file or stdin

printf "\nln:pg  text\n"            ## print header

while read -r l; do                 ## read each line
    if [ ${l:0:1} = $'\f' ]; then   ## if form-feed found
        ((pg++))
        printf "<ff>\n%2s:%2s  '%s'\n" "$ln" "$pg" "${l:1}"
    else
        printf "%2s:%2s  '%s'\n" "$ln" "$pg" "$l"
    fi
    ((ln++))
done < "$fname"

示例输入文件

嵌入式<form-feed>的简单输入文件是使用：

创建的

$ echo -e "a\nb\nc\n\fd\ne\nf\ng\nh\n\fi\nj\nk\nl" > dat/affex.txt

当输出给出时：

$ cat dat/affex.txt
a
b
c

d
e
f
g
h

i
j
k
l

示例使用/输出

$ bash affex.sh <dat/affex.txt

ln:pg  text
 1: 1  'a'
 2: 1  'b'
 3: 1  'c'
<ff>
 4: 2  'd'
 5: 2  'e'
 6: 2  'f'
 7: 2  'g'
 8: 2  'h'
<ff>
 9: 3  'i'
10: 3  'j'
11: 3  'k'
12: 3  'l'

Answer 3

使用read -d $'\f'然后计算线条的想法很好。

此版本的migth看起来并不优雅：如果nline大于或等于文件中的行数，则文件将被读取两次。

试一试，因为它超级快：

function get_page_from_line ()
{
    local nline="${1}"
    local input_file="${2}"    
    if [[ $(wc -l "${input_file}" | awk '{print $1}') -lt nline ]] ; then
        printf "0\n"
    else
        printf "%d\n" $(( $(head -n ${nline} "${input_file}" | grep -c "^"$'\f') + 1 ))
    fi
}

awk 的效果优于上述bash版本。 awk 是为此类文字处理而创建的。

试试这个测试版本：

function get_page_from_line ()
{
  awk -v nline="${1}" '
    BEGIN {
      npag=1;
    }
    {
      if (index($0,"\f")>0) {
        npag++;
      }
      if (NR==nline) {
        print npag;
        linefound=1;
        exit;
      }
    }
    END {
      if (!linefound) {
        print 0;
      }
    }' "${2}"
}

遇到\f时，页码会增加。

NR是当前行号。

----

对于历史，还有另一个bash版本。

此版本仅使用内置命令来计算当前页面中的行。

你在评论中提供的speedtest.sh显示它有点领先（大约20秒），这相当于你的版本：

function get_page_from_line ()
{
    local nline="$1"
    local input_file="$2"

    local npag=0
    local total=0

    while IFS= read -d $'\f' -r page; do
        npag=$(( npag + 1 ))
        IFS=$'\n'
        for line in ${page}
        do
            total=$(( total + 1 ))
            if [[ total -eq nline ]] ; then
                printf "%d\n" ${npag}
                unset IFS
                return
            fi
        done
        unset IFS
    done < "$input_file"
    printf "0\n"
    return
}

Answer 4

使用Awk，您可以定义RS（记录分隔符，默认换行符）以形成订阅源（\f）和IFS（输入字段分隔符，默认任何水平空格序列））到换行符（\n）并获取行数作为“记录”中“字段”的“字段”数。

在数据中放置换页符会在页面中产生一些空行，因此计算结果不会发生。

awk -F '\n' -v RS='\f' '{ print NF }' file

如果$NF == ""，您可以将数字减少一个，并且可能将所需页面的数量作为变量传递：

awk -F '\n' -v RS='\f' -v p="2" 'NR==p { print NF - ($NF == "") }' file

要获取特定行的页码，只需将head -n number提供给脚本，或者循环显示这些数字，直到累积了行数。

line=1
page=1
for count in $(awk -F '\n' -v RS='\f' '{ print NF - ($NF == "") }' file); do
    old=$line
    ((line += count))
    echo "Lines $old through line are on page $page"
    ((page++)
done

Answer 5

这个gnu awk脚本打印出作为命令行参数给出的亚麻布的“页面”：

BEGIN   { ffcount=1;
      search = ARGV[2]
      delete ARGV[2]
      if (!search ) {
        print "Please provide linenumber as argument"  
        exit(1);
      }
    }

$1 ~ search { printf( "line %s is on page %d\n", search, ffcount) }

/[\f]/ { ffcount++ }

使用awk -f formfeeds.awk formfeeds.txt 05，其中formfeeds.awk是脚本，formfeeds.txt是文件，'05'是亚麻。

BEGIN规则主要处理命令行参数。其他规则是简单的规则：

$1 ~ search

search适用
/[\f]/适用于有送纸

bash例程从文本文件

5 个答案:

----