家庭地址的Rails正则表达式模式

时间:2013-10-06 03:07:34

标签: ruby-on-rails regex

我需要解析一些法律文件来查找其中的地址。以下是一个例子

  

test =“9999 Lorem ipsum dolor sit amet,consectetur adipisicing elit,sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Ut enim ad minim veniam,quis nostrud exercitation ullamco laboris 123 some ave 12 st,some city,NY 10005 nisi ut aliquip ex ea commodo consequat.Duis aute irure dolor in reprehenderit in voluptate velit esse 124 some ave 12 st,some city,NY,10005cillum dolore eu fugiat nulla pariatur。Excepteur sint occaecat cupidatat non proident,sunt in culpa qui officia deserunt mollit anim id est laborum.Lorem ipsum dolor sit amet,consectetur adipisicing elit,sed125 some ave 12 st,some city,NY,10005 do eiusmod tempor incididunt ut labore 126 SOMETHING SOMETHING,SOME CITY,NEW YORK et dolore magna aliqua.Ut enim ad minim veniam,quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur。Excepteur sint occaecat cupidatat非正式的,在culpa qui officia deserunt mollit anim id est laborum中被捕。“

tmp = test.scan(/(\d{3,6})(.*?)(\d{5})/)
tmp.each do |t|
  puts t.join()
end

通常情况下,地址会以数字开头,并以邮政编码结尾,但在这些文件中并非总是如此。

问题是我错过了一些并得到一些不需要的结果,如:

9999 Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris 123 some ave 12 st, some city, NY, 10005
124 some ave 12 st, some city, NY, 10005
125 some ave 12 st, some city, NY, 10005
126 SOMETHING SOMETHING, SOME CITY, NEW YORK et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum 11111

我想要的是以下4项的数组:

123 some ave 12 st, some city, NY, 10005
124 some ave 12 st, some city, NY, 10005
125 some ave 12 st, some city, NY, 10005
126 SOMETHING SOMETHING, SOME CITY, NEW YORK

至于最后一项,我很确定所有格式化的地址都会以“纽约”或“纽约”结束。

我认为我的目标模式是:

/(ANY DIGITS BETWEEN 3 AND 6)(AT LEAST 3 WORDS BUT NOT MORE THAN 10)((TRY FIRST ZIPCODE)|(IF NO ZIP CODE THEN TRY "NEW YORK" OR "NY"))/i

非常感谢任何帮助。

2 个答案:

答案 0 :(得分:1)

以下是我从法律文本中解析信息的原因:

  1. 将复杂的任务分解为更简单的任务。为要捕获的每个地址变体编写正则表达式(或使用正则表达式的函数)。

  2. 为每个变体编写测试用例。以下是我为数字解析器编写的一些测试作为示例。

  3.     test '554' do                                                                                   
          assert_equal 554, number_parser.parse('five hundred fifty-four')                              
        end                                                                                             
    
        test '1301' do                                                                                  
          assert_equal 1301, number_parser.parse('thirteen hundred one')                                
        end                                                                                             
    
    1. 由于您知道某些值(例如州和州缩写)的范围,因此您可以将这些知识合并到函数中以解析变体。

答案 1 :(得分:0)

正如michaelmichael和stackoverflow.com/questions/9397485/regex-street-address-match所述,实际上没有办法正确扫描地址,当原始示例显示文档中存在大量拼写错误时,这一点就少了。 / p>

所以我把它分成两部分。

首先,扫描类似于地址的模式的函数。

# First scan for possible addresses
def look_for_address_patterns(txt)
  resp = []
  # this looks for a number that is between 2-6 digits long (similar to house address)
  # Second part adds an anchor to the next character following it and grabs the next 1-15 items (space or txt)
  # proceeding to either 5 digits (zip code) or ending with State Name / abbrev
  scan = txt.scan(\d{2,6})(\s*(\S+\s+){1,15})((?:\d{5})|(?:NEW YORK|NY))
  scan.each do |s|
    resp.push s.join()
  end
  # Go to step 2 for verifying address before returning anything
  verify_address(resp)
end

现在我们使用google,mapquest或yahoo等服务来验证地址

def verify_address(arry)
  verified = []
  arry.each do |addr|
    url = "http://maps.googleapis.com/maps/api/geocode/json?address=" + addr
    response = JSON.parse(open(url).read)
    # compare that we got something similar in address response, remove SW and from Lane to ln is ok, but anything else is probably a different address
    matched = addr.downcase[0..8] == response['results']['formatted_address'].downcase[0..8]
    # should be storing more info like lat / lng but that is for a later project
    verified.push(response['results']['formatted_address']) if matched
  end
  return verified
end

到目前为止我所知道的。第一部分工作得很好,但是给出了误报和假否定(在某些情况下,它完全错过了一个地址。)第二部分有助于清除误报并提供更好的地址格式(合法地址并不总是最好的)。

结果是捕获文档中所有地址的85%,这对我的项目是可接受的。我确信我可以通过一些微调来实现这一目标,所以请使用正则表达式大师来随时发光。