Question

我正在使用pdfgrep搜索存储在目录中的多个pdf中的名称，并将结果存储在一个文件中：

pdfgrep -R 'My string' > ../output-file

它打印以下输出：

./file1.pdf:     91   My String                               Just_another_string                   75              53            49            30              57               48                74             69
./file2.pdf:     8    My String                                Just_another_string                                                              40
./file3.pdf:     92 My String                                  Just_another_string                   64              62            76             50           76            88             80             148

我在输出中的每一列之间的每一行中都有这么多不必要的空格。我想重新格式化输出，使这些多个空格减少到每列之间只有一个空格。

有什么方法可以做到这一点吗？提前谢谢。

Answer 1

快速而肮脏的方式：使用awk。假设格式总是如下：（假设您的原始命令是正确的）

pdfgrep -R 'My string' | awk '{print "$1 $2 $3 $4 $5 $6 $7 $8 $9"}' > ../output-file

根据评论进行编辑：

@ Inian的答案更好（因为它处理了无数列的列），但简而言之，我的工作是告诉awk用空格分割输入，然后用每个列之间的一个空格打印出来。 ..例如，您可以跳过第一列不包括$ 1，或者通过打印$ 4 $ 3来交换第3和第4列。

为了提高效率，如果你想把它推到一个数据库中，你可能想要使用Python（或Perl或PHP，但快速检查我的配置文件应该显示我的偏好）来实际进行SQL导入。 500 PDF并没有真正与我相关......我希望你可以通过以下方式逃脱：

pdfgrep -R 'My string' > ../output-file

然后运行一个类似于：

的python程序

import sys

with open("output-file","rt") as f:
   for line in f:
      data = line.split() #now you have an array split by whitespace
      cleanline = " ".join(data) #now each element has a single space between it and the next
      #or you could just stick data directly into the database; details omitted because there are way too many variables here

使用pdfgrep和格式输出搜索字符串

1 个答案: