正则表达式匹配大型dat文件中不一致的地址模式

时间:2013-06-14 18:05:46

标签: regex street-address

我知道这不可能是完美的,但我对正则表达式并不是很好,而且我很难获得更好的匹配百分比。

我有一个超过900万行的文件,地址非常不一致。我想知道我是否能从这里比我更好的人那里得到一些帮助。任何帮助将不胜感激。

这是我到目前为止所拥有的。我认为攻击这个的最好方法是尝试匹配字符串末尾的模式,因为apt,bx,po box等可以在字符串的开头。

/(\d+\-\d+\s+|\d+-\D+|APT\s\D|APT\s\d+|APT\s\D\d+|APT\s\D\s\d+|SPACE\s\d+|POBOX\s\d+|BX|UNIT\s\d+|\d+-\d+|\d+)\s(.+)\s{2,}(\D+)\s(\D{2})$/

我可以看到几种模式。大量空格与文件中一样。我尝试分裂2个或更多空格以及到目前为止的正则表达式。

F_NAME L_NAMEFOR F_NAME L_NAME          ADDRESS ZIP         CITY STATE

ADDRESS        CITY STATE

ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY STATE

APT #               ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY STATE

P O BOX #             ADDRESS        CITY STATE

APT DIGIT#         ADDRESS CITY STATE 

SPACE DIGIT    ADDRESS      CITY STATE

UNIT #         ADDRESS     CITY STATE

SP DIGIT          ADDRESS      CITY STATE

DIGITS-DIGITS ADDRESS       CITY STATE

BX DIGIT       ADDRESS         CITY STATE

ADDRESS     APT #      CITY STATE

ADDRESS       UNIT #     CITY STATE

ADDRESS   P O BOX   DIGIT     CITY STATE

P O B O X    DIGIT      CITY STATE

P O BOX DIGIT    CITY      STATE

ADDRESS    SPACE/SP/SPC/UNIT DIGIT     CITY STATE

2 个答案:

答案 0 :(得分:4)

这是一个相当复杂的问题,遗憾的是没有一个简单的解决方案。

你可以尝试下面的正则表达式,这远不是完美的:

^.*?(?<address>(?:\b(?:[a-zA-Z0-9.,:;\\\/#-]|\s(?=\S))*?(?<zip>\d{5}(?:-\d{4}|-\d{6})?)?\b)?)\s{2,}(?<city>\b(?:\w|\s(?=\S))+\b)\s{1,}(?<state>\b\w{2,3}\b)(?:$|\r|\n)

enter image description here

在图像中,组1 =地址;第2组=拉链;第3组=城市;第4组=州

输入,请注意我已将STATE更改为st; zip12345;和po框digits到实际数字

F_NAME L_NAMEFOR F_NAME L_NAME          ADDRESS 12345         CITY st
ADDRESS        CITY st
ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY st
APT #               ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY st
P O BOX # 1234            ADDRESS        CITY st
APT DIGIT#         ADDRESS CITY st
SPACE DIGIT    ADDRESS      CITY st
UNIT #         ADDRESS     CITY st
SP DIGIT          ADDRESS      CITY st
DIGITS-DIGITS ADDRESS       CITY st
BX DIGIT       ADDRESS         CITY st
ADDRESS     APT #      CITY st
ADDRESS       UNIT #     CITY st
ADDRESS   P O BOX   3245     CITY st
P O B O X    123      CITY st
P O BOX 345    CITY      st
ADDRESS    SPACE/SP/SPC/UNIT DIGIT     CITY st

匹配

[0] => Array
(
    [0] => F_NAME L_NAMEFOR F_NAME L_NAME          ADDRESS 12345         CITY st
    [1] => ADDRESS        CITY st
    [2] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY st
    [3] => APT #               ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S       CITY st
    [4] => P O BOX # 1234            ADDRESS        CITY st
    [5] => APT DIGIT#         ADDRESS CITY st
    [6] => SPACE DIGIT    ADDRESS      CITY st
    [7] => UNIT #         ADDRESS     CITY st
    [8] => SP DIGIT          ADDRESS      CITY st
    [9] => DIGITS-DIGITS ADDRESS       CITY st
    [10] => BX DIGIT       ADDRESS         CITY st
    [11] => ADDRESS     APT #      CITY st
    [12] => ADDRESS       UNIT #     CITY st
    [13] => ADDRESS   P O BOX   DIGIT     CITY st
    [14] => P O B O X    123      CITY st
    [15] => P O BOX 345    CITY      st
    [16] => ADDRESS    SPACE/SP/SPC/UNIT DIGIT     CITY st
)

[address] => Array
(
    [0] => ADDRESS 12345
    [1] => ADDRESS
    [2] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S
    [3] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S
    [4] => ADDRESS
    [5] => APT DIGIT#
    [6] => ADDRESS
    [7] => ADDRESS
    [8] => ADDRESS
    [9] => DIGITS-DIGITS ADDRESS
    [10] => ADDRESS
    [11] => APT #
    [12] => UNIT #
    [13] => DIGIT
    [14] => 123
    [15] => P O BOX 345
    [16] => SPACE/SP/SPC/UNIT DIGIT
)

[zip] => Array
    (
        [0] => 12345
        [1] => 
        [2] => 
        [3] => 
        [4] => 
        [5] => 
        [6] => 
        [7] => 
        [8] => 
        [9] => 
        [10] => 
        [11] => 
        [12] => 
        [13] => 
        [14] => 
        [15] => 
        [16] => 
    )

[city] => Array
(
    [0] => CITY
    [1] => CITY
    [2] => CITY
    [3] => CITY
    [4] => CITY
    [5] => ADDRESS CITY
    [6] => CITY
    [7] => CITY
    [8] => CITY
    [9] => CITY
    [10] => CITY
    [11] => CITY
    [12] => CITY
    [13] => CITY
    [14] => CITY
    [15] => CITY
    [16] => CITY
)


[state] => Array
(
    [0] => st
    [1] => st
    [2] => st
    [3] => st
    [4] => st
    [5] => st
    [6] => st
    [7] => st
    [8] => st
    [9] => st
    [10] => st
    [11] => st
    [12] => st
    [13] => st
    [14] => st
    [15] => st
    [16] => st
)

建议查看问题11160192

答案 1 :(得分:0)

Denomales的答案对我的需求来说已经足够了,但是我想将上面的评论扩展为答案,因为我认为有一些相关的部分是针对你的问题的。

他们是美国地址吗?您可以尝试使用API​​或工具来集中提取地址。 Here's an example of such a tool from another Stack Overflow answer recently, which had a small list of addresses to match

enter image description here

为了披露,我在SmartyStreets工作并帮助开发了这个。虽然它没有专门针对电子表格或表格地址数据而设计,但 是为非自由形式文本等非均匀输入而设计的。您甚至可以将数百万行拼接成服务。

也许这会有用,因为它在文本中找到地址后也会验证地址。正如您所发现的那样,地址非常粗糙,而专用工具有时可能是处理它们的最佳方式。不是说这是你案件的正确的答案,但希望仍能提供信息。