我知道这不可能是完美的,但我对正则表达式并不是很好,而且我很难获得更好的匹配百分比。
我有一个超过900万行的文件,地址非常不一致。我想知道我是否能从这里比我更好的人那里得到一些帮助。任何帮助将不胜感激。
这是我到目前为止所拥有的。我认为攻击这个的最好方法是尝试匹配字符串末尾的模式,因为apt,bx,po box等可以在字符串的开头。
/(\d+\-\d+\s+|\d+-\D+|APT\s\D|APT\s\d+|APT\s\D\d+|APT\s\D\s\d+|SPACE\s\d+|POBOX\s\d+|BX|UNIT\s\d+|\d+-\d+|\d+)\s(.+)\s{2,}(\D+)\s(\D{2})$/
我可以看到几种模式。大量空格与文件中一样。我尝试分裂2个或更多空格以及到目前为止的正则表达式。
F_NAME L_NAMEFOR F_NAME L_NAME ADDRESS ZIP CITY STATE
ADDRESS CITY STATE
ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY STATE
APT # ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY STATE
P O BOX # ADDRESS CITY STATE
APT DIGIT# ADDRESS CITY STATE
SPACE DIGIT ADDRESS CITY STATE
UNIT # ADDRESS CITY STATE
SP DIGIT ADDRESS CITY STATE
DIGITS-DIGITS ADDRESS CITY STATE
BX DIGIT ADDRESS CITY STATE
ADDRESS APT # CITY STATE
ADDRESS UNIT # CITY STATE
ADDRESS P O BOX DIGIT CITY STATE
P O B O X DIGIT CITY STATE
P O BOX DIGIT CITY STATE
ADDRESS SPACE/SP/SPC/UNIT DIGIT CITY STATE
答案 0 :(得分:4)
这是一个相当复杂的问题,遗憾的是没有一个简单的解决方案。
你可以尝试下面的正则表达式,这远不是完美的:
^.*?(?<address>(?:\b(?:[a-zA-Z0-9.,:;\\\/#-]|\s(?=\S))*?(?<zip>\d{5}(?:-\d{4}|-\d{6})?)?\b)?)\s{2,}(?<city>\b(?:\w|\s(?=\S))+\b)\s{1,}(?<state>\b\w{2,3}\b)(?:$|\r|\n)
在图像中,组1 =地址;第2组=拉链;第3组=城市;第4组=州
输入,请注意我已将STATE
更改为st
; zip
至12345
;和po框digits
到实际数字
F_NAME L_NAMEFOR F_NAME L_NAME ADDRESS 12345 CITY st
ADDRESS CITY st
ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY st
APT # ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY st
P O BOX # 1234 ADDRESS CITY st
APT DIGIT# ADDRESS CITY st
SPACE DIGIT ADDRESS CITY st
UNIT # ADDRESS CITY st
SP DIGIT ADDRESS CITY st
DIGITS-DIGITS ADDRESS CITY st
BX DIGIT ADDRESS CITY st
ADDRESS APT # CITY st
ADDRESS UNIT # CITY st
ADDRESS P O BOX 3245 CITY st
P O B O X 123 CITY st
P O BOX 345 CITY st
ADDRESS SPACE/SP/SPC/UNIT DIGIT CITY st
匹配
[0] => Array
(
[0] => F_NAME L_NAMEFOR F_NAME L_NAME ADDRESS 12345 CITY st
[1] => ADDRESS CITY st
[2] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY st
[3] => APT # ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY st
[4] => P O BOX # 1234 ADDRESS CITY st
[5] => APT DIGIT# ADDRESS CITY st
[6] => SPACE DIGIT ADDRESS CITY st
[7] => UNIT # ADDRESS CITY st
[8] => SP DIGIT ADDRESS CITY st
[9] => DIGITS-DIGITS ADDRESS CITY st
[10] => BX DIGIT ADDRESS CITY st
[11] => ADDRESS APT # CITY st
[12] => ADDRESS UNIT # CITY st
[13] => ADDRESS P O BOX DIGIT CITY st
[14] => P O B O X 123 CITY st
[15] => P O BOX 345 CITY st
[16] => ADDRESS SPACE/SP/SPC/UNIT DIGIT CITY st
)
[address] => Array
(
[0] => ADDRESS 12345
[1] => ADDRESS
[2] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S
[3] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S
[4] => ADDRESS
[5] => APT DIGIT#
[6] => ADDRESS
[7] => ADDRESS
[8] => ADDRESS
[9] => DIGITS-DIGITS ADDRESS
[10] => ADDRESS
[11] => APT #
[12] => UNIT #
[13] => DIGIT
[14] => 123
[15] => P O BOX 345
[16] => SPACE/SP/SPC/UNIT DIGIT
)
[zip] => Array
(
[0] => 12345
[1] =>
[2] =>
[3] =>
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
[9] =>
[10] =>
[11] =>
[12] =>
[13] =>
[14] =>
[15] =>
[16] =>
)
[city] => Array
(
[0] => CITY
[1] => CITY
[2] => CITY
[3] => CITY
[4] => CITY
[5] => ADDRESS CITY
[6] => CITY
[7] => CITY
[8] => CITY
[9] => CITY
[10] => CITY
[11] => CITY
[12] => CITY
[13] => CITY
[14] => CITY
[15] => CITY
[16] => CITY
)
[state] => Array
(
[0] => st
[1] => st
[2] => st
[3] => st
[4] => st
[5] => st
[6] => st
[7] => st
[8] => st
[9] => st
[10] => st
[11] => st
[12] => st
[13] => st
[14] => st
[15] => st
[16] => st
)
建议查看问题11160192
答案 1 :(得分:0)
Denomales的答案对我的需求来说已经足够了,但是我想将上面的评论扩展为答案,因为我认为有一些相关的部分是针对你的问题的。
他们是美国地址吗?您可以尝试使用API或工具来集中提取地址。 Here's an example of such a tool from another Stack Overflow answer recently, which had a small list of addresses to match:
为了披露,我在SmartyStreets工作并帮助开发了这个。虽然它没有专门针对电子表格或表格地址数据而设计,但 是为非自由形式文本等非均匀输入而设计的。您甚至可以将数百万行拼接成服务。
也许这会有用,因为它在文本中找到地址后也会验证地址。正如您所发现的那样,地址非常粗糙,而专用工具有时可能是处理它们的最佳方式。不是说这是你案件的正确的答案,但希望仍能提供信息。