Question

假设我以pdf格式提供了一些期刊论文。我想找出论文的标题和作者列表。我怎么能在shell脚本中做到这一点？

Answer 1

我不知道这是否适用于你的期刊，它适用于一些pdf文件：

strings "myjournal.pdf" | egrep "/Author|/Title" | tr '/' '\n' | egrep "Author|Title"

Answer 2

我参与了一个项目，我们必须在pdf文件的内容中进行搜索。我们决定使用的流程如下：

首先，我们将使用以下命令将pdf文件转换为图像：

convert -density 500 "pdf_path.pdf" -depth 8 "image_output.png"

创建文件后，我们使用以下命令创建包含pdf内容的txt文件。

tesseract "image_output.png" "out_put_txt_file_name" -l por

您可能不得不更改-l por参数，因为我们使用葡萄牙文本来执行此操作。