Question

我有一个脚本可以搜索PDF以查找某些错误，并在具有相应页码的文件中记录它们的实例。（如果您感兴趣，请使用Pdfgrep。）PDF用于书籍的各个部分，因此始终从1开始编号。对于日志中的每个错误，我想显示编号＆＃ 39; s实际打印在页面上，称为作品集，以便于参考，而不是我目前拥有的PDF页码。

并非日志中的所有行都是错误。以下是一些示例输出：

Searches run on vol1.pdf, 01-06-2016

S01 SPACED SEMICOLON
77:  ences                                      Unit for Italian Studies: ; Dir C. KENNEDY      SUMMERS, P. M., Tropical Veterinary Science
143:BRAC Business School: ; Head Dr MD               BISWAS                                      Internet: www.diu.ac.bd
143:BRAC Development Institute: ; Dir Prof.        Dir for Student Welfare: GOUTAM KUMAR         Private control
261:Basic Institute of Biosciences: ; tel. (12)      College of Business Administration: Ir MARIA  Academic year: February to December
261:Basic Institute of Exact Sciences: ; tel.                                                      atinguetá
261:Basic Institute of Human Sciences: ; tel.        Committee on Ethics: Dr RODRIGO RICCI         Vice-Rector: MARILZA VIEIRA CUNHA RUDGE
299:Documentation sur les Traditions et les                                                        Interpreters (ASTI): ; Dir Dr ETIENNE ZÉ
328:              Political Science:            CRESPI, B. J.                         ing: ; tel. (604) 291-5240; f. 1987; Dir Dr R.

示例中的文件从p81开始，在脚本中以$ folio的形式捕获。对于所有以2到4位数字开头后跟冒号的行，我想用N +（$ folio -1）替换该数字。

我原本想过用这样的循环来逐行记录日志。

while read line
    do
        # magic here

    done < $log

我对命令行很新。我的第一个想法是使用grep ^ [0-9] {2,4}并以某种方式将其保存到变量然后计算，但谷歌搜索它似乎可能sed或awk可能更有用？我已经找到了许多用于将数字增加1等的答案，但没有这样的，我不确定如何继续。我非常感谢任何建议。

$ folio值每次都不同，所以我通过用户输入和$ log文件名来收集它。

标题（例如S01 SPACED SEMICOLON）需要保持不变。

Answer 1

我的Perl有点笨拙但是：

perl -nle 's/^(\d{2,4}):/$1+82 . ":"/e && print' log

159:  ences                                      Unit for Italian Studies: ; Dir C. KENNEDY      SUMMERS, P. M., Tropical Veterinary Science
225:BRAC Business School: ; Head Dr MD               BISWAS                                      Internet: www.diu.ac.bd
225:BRAC Development Institute: ; Dir Prof.        Dir for Student Welfare: GOUTAM KUMAR         Private control
343:Basic Institute of Biosciences: ; tel. (12)      College of Business Administration: Ir MARIA  Academic year: February to December
343:Basic Institute of Exact Sciences: ; tel.                                                      atinguetá
343:Basic Institute of Human Sciences: ; tel.        Committee on Ethics: Dr RODRIGO RICCI         Vice-Rector: MARILZA VIEIRA CUNHA RUDGE
381:Documentation sur les Traditions et les                                                        Interpreters (ASTI): ; Dir Dr ETIENNE ZÉ
410:              Political Science:            CRESPI, B. J.

那说...... ＆＃34;处理文件＆＃34; log＆＃34;如果您找到以2-4位数字和冒号开头的行，请计算替换行。该行必须包含您找到的数字加上82和冒号。如果您发现任何类似内容，请打印＃34;

这很难解释，但左边(...)中的任何内容都会被编号，可以在右侧用作$n。因此，我们发现的2-4位数字在替换中可用作$1。

正在进行魔术的是e，这意味着＆＃34;再执行一些Perl来计算替换字符串＆＃34; 。

如果您希望其他行（即不以数字开头的标题和行）完整传递，请将&&更改为;。事实上，正如@ 123在评论中指出的那样，如果你想要，你可以选择：

perl -pe 's/^(\d{2,4}):/$1+82 . ":"/e' log

Searches run on vol1.pdf, 01-06-2016

S01 SPACED SEMICOLON
159:  ences                                      Unit for Italian Studies: ; Dir C. KENNEDY      SUMMERS, P. M., Tropical Veterinary Science
225:BRAC Business School: ; Head Dr MD               BISWAS                                      Internet: www.diu.ac.bd
225:BRAC Development Institute: ; Dir Prof.        Dir for Student Welfare: GOUTAM KUMAR         Private control
343:Basic Institute of Biosciences: ; tel. (12)      College of Business Administration: Ir MARIA  Academic year: February to December
343:Basic Institute of Exact Sciences: ; tel.                                                      atinguetá
343:Basic Institute of Human Sciences: ; tel.        Committee on Ethics: Dr RODRIGO RICCI         Vice-Rector: MARILZA VIEIRA CUNHA RUDGE
381:Documentation sur les Traditions et les                                                        Interpreters (ASTI): ; Dir Dr ETIENNE ZÉ
410:              Political Science:            CRESPI, B. J.

Answer 2

awk解决方案可能如下所示：

#!/bin/bash

# The awk script below relies on features of POSIX awk that are not present
# in legacy awk and are not enabled by default in some other awks (e.g.
# older GNU awk).  POSIX_AWK identifies a POSIX-compliant awk to use.
POSIX_AWK='/usr/bin/awk --posix'

# ...

folio=7

# ...

$POSIX_AWK -F ':' -v offset=$(($folio - 1)) '
/^[0-9]{2,4}:.*/  { sub(/[0-9]*/, $1 + offset) }
                  { print }
' $1

awk程序嵌入在shell脚本中。算术扩展用于计算行号偏移量，然后通过awk选项预先分配给-v变量（bash在扩展awk命令行时执行此部分）。 -F ':'选项告诉awk使用冒号作为字段分隔符;这用于从编号行中提取前导数字的便利。程序读取$log指定的文件的每一行，用行号中的调整行号代替，并在每种情况下将可能修改的行打印到标准输出。

将文件中的数字与给定模式匹配固定数量

2 个答案: