在保持单词列顺序的同时拆分一串单词串

时间:2018-05-30 11:33:07

标签: ruby regex string

我有一个字符串

        WP(PIL)/7/2013        PUBLIC AND PANCHAYAT               MS PEMA BHUTIA            MR. S.K. CHETTRI,\n                                KABI LUNGCHUK                      MS PANILA THEENGH         ASST. GOVT.\n                                CONSTITUENCY, NORTH                MS MON MAYA SUBBA         ADVOCATE\n                                SIKKIM                             MS TASHI DOMA SHERPA      MR. KARMA THINLAY,\n                                Vs                                 MR SANGAY GURMEY          CENTRAL GOVT.\n                                THE SECRETARY, MINISTRY            BHUTIA                    COUNSEL\n                                OF SURFACE TRANSPORT               MR. JORGAY NAMKA          MR THINLAY DORJEE\n                                AND ORS.                           MR. ZANGPO SHERPA,        BHUTIA\n                                                                   AMICUS CURIAE             MS POLLIN RAI, ASST.\n                                                                                             GOVT. ADVOCATE\n

我使用'\ n'字符分割。它导致

["        WP(PIL)/7/2013        PUBLIC AND PANCHAYAT               MS PEMA BHUTIA            MR. S.K. CHETTRI,",
"                                KABI LUNGCHUK                      MS PANILA THEENGH         ASST. GOVT.",
"                                CONSTITUENCY, NORTH                MS MON MAYA SUBBA         ADVOCATE",
"                                SIKKIM                             MS TASHI DOMA SHERPA      MR. KARMA THINLAY,",
"                                Vs                                 MR SANGAY GURMEY          CENTRAL GOVT.",
"                                THE SECRETARY, MINISTRY            BHUTIA                    COUNSEL",
"                                OF SURFACE TRANSPORT               MR. JORGAY NAMKA          MR THINLAY DORJEE",
"                                AND ORS.                           MR. ZANGPO SHERPA,        BHUTIA",
"                                                                   AMICUS CURIAE             MS POLLIN RAI, ASST.",
"                                                                                             GOVT. ADVOCATE"]

我想为每一行提取4列(即将字符串数组转换为矩阵)。此外,提取的字符串应属于关联列。例如'GOVT。最后一个字符串中的ADVOCATE'应该被提取为['','','','GOVT。 ADVOCATE']

我正在使用 docsplit 库来解析具有表格数据的pdf。问题是pdf中的每一行都有内部表,它类似于下面指定的字符串数组。

我尝试将每列的第一个字符的索引作为参考,并使用这些值来处理字符串,但无法使用有效的解决方案。

2 个答案:

答案 0 :(得分:1)

根据我上面的评论,这是我的解决方案:

require 'pp'

test_array = ["        WP(PIL)/7/2013        PUBLIC AND PANCHAYAT               MS PEMA BHUTIA            MR. S.K. CHETTRI,",
"                                KABI LUNGCHUK                      MS PANILA THEENGH         ASST. GOVT.",
"                                CONSTITUENCY, NORTH                MS MON MAYA SUBBA         ADVOCATE",
"                                SIKKIM                             MS TASHI DOMA SHERPA      MR. KARMA THINLAY,",
"                                Vs                                 MR SANGAY GURMEY          CENTRAL GOVT.",
"                                THE SECRETARY, MINISTRY            BHUTIA                    COUNSEL",
"                                OF SURFACE TRANSPORT               MR. JORGAY NAMKA          MR THINLAY DORJEE",
"                                AND ORS.                           MR. ZANGPO SHERPA,        BHUTIA",
"                                                                   AMICUS CURIAE             MS POLLIN RAI, ASST.",
"                                                                                             GOVT. ADVOCATE"]

class ColumnAnalyzer

  attr_reader :columns
  attr_accessor :array

  def initialize(array)
    @array = array
    analyze
  end

  def analyze
    lefts = Array.new
    rights = Array.new
    @array.each do |line|
      pos_left =  Array.new
      deconstruct = line.dup
      col = 0
      while m = deconstruct.match(/\s\s[^\s]{1}/) do
        left = m.offset(0)[0]+1
        pos_left[col] = col == 0 ? left : left + pos_left[col-1]
        col += 1
        deconstruct = deconstruct[left+1..-1]
      end
      lefts.push pos_left
      pos_right = Array.new
      deconstruct = line.dup
      col = 0
      while m = deconstruct.match(/[^\s]{1}\s\s/) do
        right = m.offset(0)[0]
        pos_right[col] = col == 0 ? right : right + pos_right[col-1]
        col += 1
        deconstruct = deconstruct[right+1..-1]
      end
      pos_right.push line.length
      rights.push pos_right
    end
    cols_l = lefts.collect { |a| a.size }.max 
    cols_r = rights.collect { |a| a.size }.max
    cols = [cols_l,cols_r].max # no. of columns
    @columns = Array.new
    (0..cols-1).each do |col|
      @columns[col] = Hash.new
      @columns[col][:l] = lefts.map { |a| a[col] }.min
      lefts.select { |a| a.size < cols }.map! { |a| a.unshift 0 }
      rights.select { |a| a.size < cols }.map! { |a| a.unshift 0 }
    end
    (0..cols-1).each do |col|
      @columns[col][:r]  = rights.map { |a| a[col] }.max
    end
  end

  def extract
    data = Array.new
    @array.each do |line|
      line_array = Array.new
      @columns.each do |col|
        line_array.push line[col[:l]..col[:r]].strip!
      end
      data.push line_array
    end
    data
  end

end

ca = ColumnAnalyzer.new(test_array)
data = ca.extract
pp ca.columns
pp data

=> [{:l=>7, :r=>21}, {:l=>28, :r=>54}, {:l=>62, :r=>85}, {:l=>87, :r=>113}]
[["WP(PIL)/7/2013",
  "PUBLIC AND PANCHAYAT",
  "MS PEMA BHUTIA",
  "MR. S.K. CHETTRI,"],
 ["", "KABI LUNGCHUK", "MS PANILA THEENGH", "ASST. GOVT."],
 ["", "CONSTITUENCY, NORTH", "MS MON MAYA SUBBA", "ADVOCATE"],
 ["", "SIKKIM", "MS TASHI DOMA SHERP", "MR. KARMA THINLAY,"],
 ["", "Vs", "MR SANGAY GURMEY", "CENTRAL GOVT."],
 ["", "THE SECRETARY, MINISTRY", "BHUTIA", "COUNSEL"],
 ["", "OF SURFACE TRANSPORT", "MR. JORGAY NAMKA", "MR THINLAY DORJEE"],
 ["", "AND ORS.", "MR. ZANGPO SHERPA,", "BHUTIA"],
 ["", "", "AMICUS CURIAE", "MS POLLIN RAI, ASST."],
 ["", "", "", "GOVT. ADVOCATE"]]

答案 1 :(得分:0)

我解决了上述问题。它不是一个完美的解决方案,但适用于大多数情况。

我假设数组中的第一个字符串是最长的(包含所有列的数据)。

Docsplit在当前问题的背景下无关紧要。无论如何,添加Gemfile

gem 'docsplit', git: 'git@github.com:prasadsurase/docsplit.git', branch: 'layout-nopgbrk-support'

在控制台中运行以下代码以从pdf获取文本。

Docsplit.extract_text(pdf_file_path, { layout: true, nopgbrk: true, output: "#{Rails.root}/tmp/pdf_to_text/" })

假设只有4个逻辑数据列

indices = arr.first.scan(/\s{2,}\S{2,3}\s{1}*/).map{|substr| arr.first.index(substr.strip) }
count = indices.count
actual_data = arr.map do |str|
  record = []
  count.times do |i|
    record << [count - 1  == i ? str[indices[i]..-1] : str[indices[i]..indices[i+1] - 1]]
  end
  record
end

details = [:first, :second, :third, :fourth].map do |indx|
  actual_data.map(&indx).join('').strip.gsub(/\s+/, ' ')
end

详情是一个包含4个字符串的数组。