我正在尝试从嵌入在网页中的PDF中提取文本。我尝试使用pdf-reader gem,但是我得到了一个解析错误。
`find_first_xref_offset': PDF does not contain EOF marker (PDF::Reader::MalformedPDFError)
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader/xref.rb:99:in `load_offsets'
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader/xref.rb:60:in `initialize'
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader/object_hash.rb:44:in `new'
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader/object_hash.rb:44:in `initialize'
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader.rb:117:in `new'
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader.rb:117:in `initialize'
from role.rb:5:in `new'
from role.rb:5:in `<main>'
任何人都知道如何解决这个问题? 为此目的有一个更好的宝石?
感谢
答案 0 :(得分:0)
我在Google上查找您的问题时发现了这一点。它可能会提供一些可以解决问题的方法吗?
#################################################################
# Extract text from a PDF file
# This scraper takes about 2 minutes to run and no output
# appears until the end.
#################################################################
# This scraper uses the pdf-reader gem.
# Documentation is at https://github.com/yob/pdf-reader#readme
# If you have problems you can ask for help at http://groups.google.com/group/pdf-reader
require 'pdf-reader'
require 'open-uri'
########## This section contains the callback code that processes the PDF file contents ######
class PageTextReceiver
attr_accessor :content, :page_counter
def initialize
@content = []
@page_counter = 0
end
# Called when page parsing starts
def begin_page(arg = nil)
@page_counter += 1
@content << ""
end
# record text that is drawn on the page
def show_text(string, *params)
@content.last << string
end
# there's a few text callbacks, so make sure we process them all
alias :super_show_text :show_text
alias :move_to_next_line_and_show_text :show_text
alias :set_spacing_next_line_show_text :show_text
# this final text callback takes slightly different arguments
def show_text_with_positioning(*params)
params = params.first
params.each { |str| show_text(str) if str.kind_of?(String)}
end
end
################ End of TextReceiver #############################
# If you don't have two minutes to wait you might prefer this
# smaller pdf
# pdf = open('http://www.hmrc.gov.uk/factsheets/import-export.pdf')
# pdf = open('http://www.madingley.org/uploaded/Hansard_08.07.2010.pdf')
pdf = open('http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf')
####### Instantiate the receiver and the reader
receiver = PageTextReceiver.new
pdf_reader = PDF::Reader.new
####### Now you just need to make the call to parse...
pdf_reader.parse(pdf, receiver)
####### ...and do whatever you want with the text.
####### This just outputs it.
receiver.content.each {|r| puts r.strip}