我有一个字符串
WP(PIL)/7/2013 PUBLIC AND PANCHAYAT MS PEMA BHUTIA MR. S.K. CHETTRI,\n KABI LUNGCHUK MS PANILA THEENGH ASST. GOVT.\n CONSTITUENCY, NORTH MS MON MAYA SUBBA ADVOCATE\n SIKKIM MS TASHI DOMA SHERPA MR. KARMA THINLAY,\n Vs MR SANGAY GURMEY CENTRAL GOVT.\n THE SECRETARY, MINISTRY BHUTIA COUNSEL\n OF SURFACE TRANSPORT MR. JORGAY NAMKA MR THINLAY DORJEE\n AND ORS. MR. ZANGPO SHERPA, BHUTIA\n AMICUS CURIAE MS POLLIN RAI, ASST.\n GOVT. ADVOCATE\n
我使用'\ n'字符分割。它导致
[" WP(PIL)/7/2013 PUBLIC AND PANCHAYAT MS PEMA BHUTIA MR. S.K. CHETTRI,",
" KABI LUNGCHUK MS PANILA THEENGH ASST. GOVT.",
" CONSTITUENCY, NORTH MS MON MAYA SUBBA ADVOCATE",
" SIKKIM MS TASHI DOMA SHERPA MR. KARMA THINLAY,",
" Vs MR SANGAY GURMEY CENTRAL GOVT.",
" THE SECRETARY, MINISTRY BHUTIA COUNSEL",
" OF SURFACE TRANSPORT MR. JORGAY NAMKA MR THINLAY DORJEE",
" AND ORS. MR. ZANGPO SHERPA, BHUTIA",
" AMICUS CURIAE MS POLLIN RAI, ASST.",
" GOVT. ADVOCATE"]
我想为每一行提取4列(即将字符串数组转换为矩阵)。此外,提取的字符串应属于关联列。例如'GOVT。最后一个字符串中的ADVOCATE'应该被提取为['','','','GOVT。 ADVOCATE']
我正在使用 docsplit 库来解析具有表格数据的pdf。问题是pdf中的每一行都有内部表,它类似于下面指定的字符串数组。
我尝试将每列的第一个字符的索引作为参考,并使用这些值来处理字符串,但无法使用有效的解决方案。
答案 0 :(得分:1)
根据我上面的评论,这是我的解决方案:
require 'pp'
test_array = [" WP(PIL)/7/2013 PUBLIC AND PANCHAYAT MS PEMA BHUTIA MR. S.K. CHETTRI,",
" KABI LUNGCHUK MS PANILA THEENGH ASST. GOVT.",
" CONSTITUENCY, NORTH MS MON MAYA SUBBA ADVOCATE",
" SIKKIM MS TASHI DOMA SHERPA MR. KARMA THINLAY,",
" Vs MR SANGAY GURMEY CENTRAL GOVT.",
" THE SECRETARY, MINISTRY BHUTIA COUNSEL",
" OF SURFACE TRANSPORT MR. JORGAY NAMKA MR THINLAY DORJEE",
" AND ORS. MR. ZANGPO SHERPA, BHUTIA",
" AMICUS CURIAE MS POLLIN RAI, ASST.",
" GOVT. ADVOCATE"]
class ColumnAnalyzer
attr_reader :columns
attr_accessor :array
def initialize(array)
@array = array
analyze
end
def analyze
lefts = Array.new
rights = Array.new
@array.each do |line|
pos_left = Array.new
deconstruct = line.dup
col = 0
while m = deconstruct.match(/\s\s[^\s]{1}/) do
left = m.offset(0)[0]+1
pos_left[col] = col == 0 ? left : left + pos_left[col-1]
col += 1
deconstruct = deconstruct[left+1..-1]
end
lefts.push pos_left
pos_right = Array.new
deconstruct = line.dup
col = 0
while m = deconstruct.match(/[^\s]{1}\s\s/) do
right = m.offset(0)[0]
pos_right[col] = col == 0 ? right : right + pos_right[col-1]
col += 1
deconstruct = deconstruct[right+1..-1]
end
pos_right.push line.length
rights.push pos_right
end
cols_l = lefts.collect { |a| a.size }.max
cols_r = rights.collect { |a| a.size }.max
cols = [cols_l,cols_r].max # no. of columns
@columns = Array.new
(0..cols-1).each do |col|
@columns[col] = Hash.new
@columns[col][:l] = lefts.map { |a| a[col] }.min
lefts.select { |a| a.size < cols }.map! { |a| a.unshift 0 }
rights.select { |a| a.size < cols }.map! { |a| a.unshift 0 }
end
(0..cols-1).each do |col|
@columns[col][:r] = rights.map { |a| a[col] }.max
end
end
def extract
data = Array.new
@array.each do |line|
line_array = Array.new
@columns.each do |col|
line_array.push line[col[:l]..col[:r]].strip!
end
data.push line_array
end
data
end
end
ca = ColumnAnalyzer.new(test_array)
data = ca.extract
pp ca.columns
pp data
=> [{:l=>7, :r=>21}, {:l=>28, :r=>54}, {:l=>62, :r=>85}, {:l=>87, :r=>113}]
[["WP(PIL)/7/2013",
"PUBLIC AND PANCHAYAT",
"MS PEMA BHUTIA",
"MR. S.K. CHETTRI,"],
["", "KABI LUNGCHUK", "MS PANILA THEENGH", "ASST. GOVT."],
["", "CONSTITUENCY, NORTH", "MS MON MAYA SUBBA", "ADVOCATE"],
["", "SIKKIM", "MS TASHI DOMA SHERP", "MR. KARMA THINLAY,"],
["", "Vs", "MR SANGAY GURMEY", "CENTRAL GOVT."],
["", "THE SECRETARY, MINISTRY", "BHUTIA", "COUNSEL"],
["", "OF SURFACE TRANSPORT", "MR. JORGAY NAMKA", "MR THINLAY DORJEE"],
["", "AND ORS.", "MR. ZANGPO SHERPA,", "BHUTIA"],
["", "", "AMICUS CURIAE", "MS POLLIN RAI, ASST."],
["", "", "", "GOVT. ADVOCATE"]]
答案 1 :(得分:0)
我解决了上述问题。它不是一个完美的解决方案,但适用于大多数情况。
我假设数组中的第一个字符串是最长的(包含所有列的数据)。
Docsplit在当前问题的背景下无关紧要。无论如何,添加Gemfile
gem 'docsplit', git: 'git@github.com:prasadsurase/docsplit.git', branch: 'layout-nopgbrk-support'
在控制台中运行以下代码以从pdf获取文本。
Docsplit.extract_text(pdf_file_path, { layout: true, nopgbrk: true, output: "#{Rails.root}/tmp/pdf_to_text/" })
假设只有4个逻辑数据列
indices = arr.first.scan(/\s{2,}\S{2,3}\s{1}*/).map{|substr| arr.first.index(substr.strip) }
count = indices.count
actual_data = arr.map do |str|
record = []
count.times do |i|
record << [count - 1 == i ? str[indices[i]..-1] : str[indices[i]..indices[i+1] - 1]]
end
record
end
details = [:first, :second, :third, :fourth].map do |indx|
actual_data.map(&indx).join('').strip.gsub(/\s+/, ' ')
end
详情是一个包含4个字符串的数组。