Question

使用docsplit gem，我可以从PDF或任何其他文件类型中提取文本。例如，使用以下行：

 Docsplit.extract_pages('doc.pdf')

我可以拥有PDF文件的文字内容。

我目前正在使用Rails，PDF通过请求发送并存在于内存中。查看API和源代码，我找不到从内存中提取文本的方法，只能从文件中提取。

有没有办法让这个PDF文本避免创建临时文件？

如果重要，我正在使用attachment_fu。

Answer 1

使用临时目录：

require 'docsplit'

def pdf_to_text(pdf_filename)
  Docsplit.extract_text([pdf_filename], ocr: false, output: Dir.tmpdir)

  txt_file = File.basename(pdf_filename, File.extname(pdf_filename)) + '.txt'
  txt_filename = Dir.tmpdir + '/' + txt_file

  extracted_text = File.read(txt_filename)
  File.delete(txt_filename)

  extracted_text
end

pdf_to_text('doc.pdf')

Answer 2

如果您在字符串中包含内容，请使用StringIO创建IO可以读取的类文件对象。在StringIO中，无论内容是真文本还是二进制文件都没关系，它们都是一样的。

看看以下任何一个：

new(string=""[, mode])
Creates new StringIO instance from with string and mode.

open(string=""[, mode]) {|strio| ...}
Equivalent to ::new except that when it is called with a block, it yields with the new instance and closes it, and returns the result which returned from the block.

使用docsplit从内存中的文档中提取文本

2 个答案: