ruby正则表达式和多行字符串

时间:2013-08-05 18:14:04

标签: ruby regex string pdf

我正在使用虾宝石阅读60页的计算机生成的pdf报告,其中包含数十个人的财务和人口统计数据。我面临的挑战是,我希望能够捕获名称/特殊ID(在同一条线上)以及在扫描每条线时与该人相关的后续行。使用ruby的字符串扫描方法,我已经能够以这种方式捕获每个匹配返回行的财务状况:

[<invoice no.>, <service type>, <modifier (if any)>, <service_date>, <units>, <amount>]

我试图将ID与财务数据关联几行,然后在ID发生变化时更改但没有任何效果。我是以屁股倒退的方式来做这件事吗?我对正则表达式的经验很少(一般都是编程)。

以下是仅适用于财务数据的代码:

PDF::Reader.new(file).pages.each do |page|
  page.raw_content.scan(/^\(\s(\d{6})\s+\d\s+(\w\d{4})\s+(0580|TT|1C|1C\s+1F)?\s+(\d+\/\d+\/\d+)\s+\d+\/\d+\/\d+\s+(\d+\.\d+)\s+(\d+\.\d+)/) do |line|        
    line.collect {|x| x.strip! if !x.nil?}
    print "#{line.join(' ')}\n"
    Cycle.check_details(line)
  end
end

以下是puts page.raw_content产生的样本(这些行中包含大量空白空格)。

(REG  LOC   CLIENT   SERVICE   NAME                    BIRTH DATE   RECIPIENT ID    PRIOR AUTHORIZATION #)'
(xx   xxx  xxxxx     xxxxxxx    LANNISTER, JAIME         xx/xx/xxxx   xxxx <special ID>)'
(DIAGNOSIS CODES:  887.0)'
( )'
(  INV #   LINE #   PROCEDURE CODE  REVENUE CD   FROM DT   THRU DT     UNITS AMOUNT)'
( <inv num>       1    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       2    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     2.50     41.00)'
( <inv num>       3    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       4    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       5    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       6    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       7    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
(                                                                CLAIM TOTAL
   434.60   CLAIM ACCOUNT REF.  xxxxxxxxxxxxxxxSUP)'

(REG  LOC   CLIENT   SERVICE   NAME                    BIRTH DATE   RECIPIENT ID    PRIOR AUTHORIZATION #)'
(xx   xxx  xxxxx     xxxxxxx    LANNISTER, JOFFREY         xx/xx/xxxx   xxxx <special ID>)'
(DIAGNOSIS CODES:  259.0)'
( )'
(  INV #   LINE #   PROCEDURE CODE  REVENUE CD   FROM DT   THRU DT     UNITS AMOUNT)'
( <inv num>       1    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       2    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     2.50     41.00)'
( <inv num>       3    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       4    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       5    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       6    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       7    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
(                                                                CLAIM TOTAL
   434.60   CLAIM ACCOUNT REF.  xxxxxxxxxxxxxxxSUP)'

2 个答案:

答案 0 :(得分:1)

并非所有东西都是用正则表达式解析的候选者。而且,有时正式 将数据分解为可管理的块后,正则表达式非常有用。您的数据是第二种情况的示例。一旦它被分解了一些,就可以很容易地解析各个行。

您的数据令人困惑,但这会解开它。删除前导(和尾随)'后,代码会使用split将其分成单独的行,然后使用slice_before将其分解为逻辑块。一旦收集了这些,就可以以合理的方式处理每个块:

require 'prettyprint'

data = "(REG  LOC   CLIENT   SERVICE   NAME                    BIRTH DATE   RECIPIENT ID    PRIOR AUTHORIZATION #)'
(xx   xxx  xxxxx     xxxxxxx    LANNISTER, JAIME         xx/xx/xxxx   xxxx <special ID>)'
(DIAGNOSIS CODES:  887.0)'
( )'
(  INV #   LINE #   PROCEDURE CODE  REVENUE CD   FROM DT   THRU DT     UNITS AMOUNT)'
( <inv num>       1    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       2    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     2.50     41.00)'
( <inv num>       3    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       4    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       5    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       6    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       7    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
(                                                                CLAIM TOTAL
  434.60   CLAIM ACCOUNT REF.  xxxxxxxxxxxxxxxSUP)'

(REG  LOC   CLIENT   SERVICE   NAME                    BIRTH DATE   RECIPIENT ID    PRIOR AUTHORIZATION #)'
(xx   xxx  xxxxx     xxxxxxx    LANNISTER, JOFFREY         xx/xx/xxxx   xxxx <special ID>)'
(DIAGNOSIS CODES:  259.0)'
( )'
(  INV #   LINE #   PROCEDURE CODE  REVENUE CD   FROM DT   THRU DT     UNITS AMOUNT)'
( <inv num>       1    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       2    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     2.50     41.00)'
( <inv num>       3    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       4    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       5    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       6    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       7    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
(                                                                CLAIM TOTAL
  434.60   CLAIM ACCOUNT REF.  xxxxxxxxxxxxxxxSUP)'
"

lines = data.gsub(/^\(|\)'$/m, '').split("\n").map{ |s| s.strip }.reject{ |s| s.empty? }.slice_before(/^REG\b/)

此时,lines是一个数组数组。每个子阵列由以“REG”开头的行块组成。每次slice_before看到匹配/^REG\b/的新行时,它都会创建一个新的子数组/块。 lines是一个枚举器,它类似于从散列中获取数组或单个键/值对之前的初步对象。您可以遍历枚举器,这是我们想要做的:

patient_data = lines.map { |sub_ary|
  sub_ary[1][/(?:\S+ \s+ ){4} (\S+, \s+ \S+) \s+ (?:\S+ \s+){2} (.+)$/x]
  patient_name, special_id = $1, $2

  invoice_info = sub_ary[5..-3].map{ |line|
    line[/^(\S+) \s+ \S+ \s+ (\S+) \s+ (\S+)/x]
    [$1, $2, $3]
  }

  {
    patient_name: patient_name,
    special_id:   special_id,
    invoice_info: invoice_info
  }
}

pp patient_data

哪个输出:

[{:patient_name=>"LANNISTER, JAIME",
  :special_id=>"<special ID>",
  :invoice_info=>
  [["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"]]},
{:patient_name=>"LANNISTER, JOFFREY",
  :special_id=>"<special ID>",
  :invoice_info=>
  [["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"]]}]

这会让你接近但不能完全解决问题。我故意让你知道如何修改代码以从记录中获取所需的所有字段。

答案 1 :(得分:0)

如果您想测试正则表达式,请查看http://rubular.com/

这是一个非常有用的工具,并且在页面底部有正则表达式的大部分基础