Ruby代码用智能从不规则文本中提取数据

时间:2013-11-28 15:44:57

标签: ruby parsing extract text-parsing

我正在尝试编写一个ruby代码,用于从不规则文本内容中提取特定位置的数据。

以下是我正在查看的文字内容。

                 Address1                                                   Address2                                       

 adress1, adress1, # 34 , adress1, 
 4th Floor, Plot # 14 & 15, 
 Drive,,                                                               HARIKA BHIMANI

 Madhapur, Hyderabad - 500081                                          2-14-117/35-1 Nas                   
 Andhra Pradesh                                                        AP                                                 
 +(91)40-00000000
 xyz@dabc.com

这是我的奇怪文本,我想分别提取Address1和Address2。 我以为我会尝试拆分但是没有得到如何分别提取Address1和Address2,因为它们都在单行中。 Address1和Address2的内容之间的空间肯定会超过2个空格。

我打算解析每一行,并在每行中拆分多于1个空格的分隔符。如何用超过两个空格的分隔符分割ruby中的字符串?

我们可以忽略上面文本中的前两行,从第3行开始。基本上我想分开左侧和右侧数据。分隔符超过2个空格。我已经用我的示例编码编辑了这个问题,但是如果左侧数据中的一行是空的则它是失败的

我试过以下示例

if !line.empty?

                splits =  line.split(/ {2,}/)

                case splits.length
                    when 2
                        puts "Address1 "+ splits[1]
                    when 3
                        puts "Address1 "+ splits[1]
                        puts "Address2 "+ splits[2] 
                    else

                end
            end

但是以下示例

失败了
   leftSideHasData                    rightSideHasData
                                   OnlyRightSideHasData

我如何在Ruby中实现这一目标? ruby是否提供任何apis轻松做到这一点?

1 个答案:

答案 0 :(得分:0)

text = %W{ Address1 Address2

adress1, adress1, # 34 , adress1, 4th Floor, Plot # 14 & 15, Drive,, HARIKA BHIMANI

Madhapur, Hyderabad - 500081 2-14-117/35-1 Nas
Andhra Pradesh AP
+(91)40-00000000 xyz@dabc.com}

rows = text.split("\n").map { |row| row.split(/\s{2,}/) }

rows.each { |row| address1 << row[0]; address2 << row[1] }

address1
=> ["",
" adress1, adress1, # 34 , adress1, ",
" 4th Floor, Plot # 14 & 15, ",
" Drive,,",
" Madhapur, Hyderabad - 500081",
" Andhra Pradesh",
" +(91)40-00000000",
" xyz@dabc.com"]

rows = text.split("\n").map { |row| row.split(/\s{2,}/) } rows.each { |row| address1 << row[0]; address2 << row[1] } address1 => ["", " adress1, adress1, # 34 , adress1, ", " 4th Floor, Plot # 14 & 15, ", " Drive,,", " Madhapur, Hyderabad - 500081", " Andhra Pradesh", " +(91)40-00000000", " xyz@dabc.com"]

您可以使用address2 => ["Address1", nil, nil, "HARIKA BHIMANI", "2-14-117/35-1 Nas", "AP", nil, nil]

删除nils