Question

我一直在尝试一个简单的Ruby程序来解析一个简单的pdf文件并提取我感兴趣的文本。我发现pdf-reader对于pdf文件解析是非常好的宝石。我已经阅读了那个gem中给出的例子和一些tutorials周围的例子。

我已经尝试了回调方法，并且能够从我的pdf文件中获取所有文本。但我不理解一些回调的论据背后的概念。

例如，如果我的pdf有一个包含3列和2行的简单表。（标题行值是名称，地址，年龄），第一行值是（Arun，Hoskote，22），当U运行红宝石脚本后的红宝石

receiver = PDF::Reader::RegisterReceiver.new
reader = PDF::Reader.new("Arun.pdf")
reader.pages.each do |page|
    page.walk(receiver)
    receiver.callbacks.each do |cb|
      puts cb.inspect
    end
end

它会打印一系列回调，其中一些有趣的回调 show_text_with_positioning 就像跟随

{:name=>:show_text_with_positioning, :args=>[["N", 5, "am", -4, "e"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["Ad", 6, "d", 3, "ress"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["Age"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["Ar", 4, "u", 3, "n"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["H", 3, "o", -5, "sk", 9, "o", -5,     "te"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["22"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}

从上面的回调中，args代表什么是pdf文件？如果我只想在这里提取名称值'Arun'（任何东西可以来这里）或年龄值i，e'25'（任何值都可以到这里）这个例子中，我怎么能在ruby程序中这样做？是否有任何pdf-parser API或Ruby API只能从pdf文件中获取一个“感兴趣”的值？

如何编写一个Ruby程序来访问我感兴趣的特定回调，它给了我想要的文本？

Answer 1

如果你特别想要文本，你可以做这样的事情（但可能使用不同的流作为文本的目的地）：

receiver = PDF::Reader::TextReceiver.new($stdout)
PDF::Reader.file("Arun.pdf", receiver)

获得文本后，您可以使用正则表达式或其他任何内容来获取您想要的特定值。

如何在Ruby中解析pdf

1 个答案: