我正在使用虾宝石阅读60页的计算机生成的pdf报告,其中包含数十个人的财务和人口统计数据。我面临的挑战是,我希望能够捕获名称/特殊ID(在同一条线上)以及在扫描每条线时与该人相关的后续行。使用ruby的字符串扫描方法,我已经能够以这种方式捕获每个匹配返回行的财务状况:
[<invoice no.>, <service type>, <modifier (if any)>, <service_date>, <units>, <amount>]
我试图将ID与财务数据关联几行,然后在ID发生变化时更改但没有任何效果。我是以屁股倒退的方式来做这件事吗?我对正则表达式的经验很少(一般都是编程)。
以下是仅适用于财务数据的代码:
PDF::Reader.new(file).pages.each do |page|
page.raw_content.scan(/^\(\s(\d{6})\s+\d\s+(\w\d{4})\s+(0580|TT|1C|1C\s+1F)?\s+(\d+\/\d+\/\d+)\s+\d+\/\d+\/\d+\s+(\d+\.\d+)\s+(\d+\.\d+)/) do |line|
line.collect {|x| x.strip! if !x.nil?}
print "#{line.join(' ')}\n"
Cycle.check_details(line)
end
end
以下是puts page.raw_content
产生的样本(这些行中包含大量空白空格)。
(REG LOC CLIENT SERVICE NAME BIRTH DATE RECIPIENT ID PRIOR AUTHORIZATION #)'
(xx xxx xxxxx xxxxxxx LANNISTER, JAIME xx/xx/xxxx xxxx <special ID>)'
(DIAGNOSIS CODES: 887.0)'
( )'
( INV # LINE # PROCEDURE CODE REVENUE CD FROM DT THRU DT UNITS AMOUNT)'
( <inv num> 1 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 2 <service_code> <modifier> xx/xx/13 xx/xx/13 2.50 41.00)'
( <inv num> 3 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 4 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 5 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 6 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 7 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( CLAIM TOTAL
434.60 CLAIM ACCOUNT REF. xxxxxxxxxxxxxxxSUP)'
(REG LOC CLIENT SERVICE NAME BIRTH DATE RECIPIENT ID PRIOR AUTHORIZATION #)'
(xx xxx xxxxx xxxxxxx LANNISTER, JOFFREY xx/xx/xxxx xxxx <special ID>)'
(DIAGNOSIS CODES: 259.0)'
( )'
( INV # LINE # PROCEDURE CODE REVENUE CD FROM DT THRU DT UNITS AMOUNT)'
( <inv num> 1 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 2 <service_code> <modifier> xx/xx/13 xx/xx/13 2.50 41.00)'
( <inv num> 3 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 4 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 5 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 6 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 7 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( CLAIM TOTAL
434.60 CLAIM ACCOUNT REF. xxxxxxxxxxxxxxxSUP)'
答案 0 :(得分:1)
并非所有东西都是用正则表达式解析的候选者。而且,有时正式 将数据分解为可管理的块后,正则表达式非常有用。您的数据是第二种情况的示例。一旦它被分解了一些,就可以很容易地解析各个行。
您的数据令人困惑,但这会解开它。删除前导(
和尾随)'
后,代码会使用split
将其分成单独的行,然后使用slice_before
将其分解为逻辑块。一旦收集了这些,就可以以合理的方式处理每个块:
require 'prettyprint'
data = "(REG LOC CLIENT SERVICE NAME BIRTH DATE RECIPIENT ID PRIOR AUTHORIZATION #)'
(xx xxx xxxxx xxxxxxx LANNISTER, JAIME xx/xx/xxxx xxxx <special ID>)'
(DIAGNOSIS CODES: 887.0)'
( )'
( INV # LINE # PROCEDURE CODE REVENUE CD FROM DT THRU DT UNITS AMOUNT)'
( <inv num> 1 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 2 <service_code> <modifier> xx/xx/13 xx/xx/13 2.50 41.00)'
( <inv num> 3 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 4 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 5 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 6 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 7 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( CLAIM TOTAL
434.60 CLAIM ACCOUNT REF. xxxxxxxxxxxxxxxSUP)'
(REG LOC CLIENT SERVICE NAME BIRTH DATE RECIPIENT ID PRIOR AUTHORIZATION #)'
(xx xxx xxxxx xxxxxxx LANNISTER, JOFFREY xx/xx/xxxx xxxx <special ID>)'
(DIAGNOSIS CODES: 259.0)'
( )'
( INV # LINE # PROCEDURE CODE REVENUE CD FROM DT THRU DT UNITS AMOUNT)'
( <inv num> 1 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 2 <service_code> <modifier> xx/xx/13 xx/xx/13 2.50 41.00)'
( <inv num> 3 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 4 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 5 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 6 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( <inv num> 7 <service_code> <modifier> xx/xx/13 xx/xx/13 4.00 65.60)'
( CLAIM TOTAL
434.60 CLAIM ACCOUNT REF. xxxxxxxxxxxxxxxSUP)'
"
lines = data.gsub(/^\(|\)'$/m, '').split("\n").map{ |s| s.strip }.reject{ |s| s.empty? }.slice_before(/^REG\b/)
此时,lines
是一个数组数组。每个子阵列由以“REG”开头的行块组成。每次slice_before
看到匹配/^REG\b/
的新行时,它都会创建一个新的子数组/块。 lines
是一个枚举器,它类似于从散列中获取数组或单个键/值对之前的初步对象。您可以遍历枚举器,这是我们想要做的:
patient_data = lines.map { |sub_ary|
sub_ary[1][/(?:\S+ \s+ ){4} (\S+, \s+ \S+) \s+ (?:\S+ \s+){2} (.+)$/x]
patient_name, special_id = $1, $2
invoice_info = sub_ary[5..-3].map{ |line|
line[/^(\S+) \s+ \S+ \s+ (\S+) \s+ (\S+)/x]
[$1, $2, $3]
}
{
patient_name: patient_name,
special_id: special_id,
invoice_info: invoice_info
}
}
pp patient_data
哪个输出:
[{:patient_name=>"LANNISTER, JAIME",
:special_id=>"<special ID>",
:invoice_info=>
[["<inv_num>", "<service_code>", "<modifier>"],
["<inv_num>", "<service_code>", "<modifier>"],
["<inv_num>", "<service_code>", "<modifier>"],
["<inv_num>", "<service_code>", "<modifier>"],
["<inv_num>", "<service_code>", "<modifier>"],
["<inv_num>", "<service_code>", "<modifier>"]]},
{:patient_name=>"LANNISTER, JOFFREY",
:special_id=>"<special ID>",
:invoice_info=>
[["<inv_num>", "<service_code>", "<modifier>"],
["<inv_num>", "<service_code>", "<modifier>"],
["<inv_num>", "<service_code>", "<modifier>"],
["<inv_num>", "<service_code>", "<modifier>"],
["<inv_num>", "<service_code>", "<modifier>"],
["<inv_num>", "<service_code>", "<modifier>"]]}]
这会让你接近但不能完全解决问题。我故意让你知道如何修改代码以从记录中获取所需的所有字段。
答案 1 :(得分:0)
如果您想测试正则表达式,请查看http://rubular.com/
这是一个非常有用的工具,并且在页面底部有正则表达式的大部分基础