See demo

Question

我正在尝试过滤街道名称并获取我想要的部分。名称有多种格式。以下是一些例子和我想要的内容。

Car Cycle 5 B Ap 1233       < what I have
Car Cycle 5 B               < what I want

Potato street 13 1 AB       < what I have
Potato street 13            < what I want

Chrome Safari 41 Ap 765     < what I have
Chrome Safari 41            < what I want

Highstreet 53 Ap 2632/BH    < what I have
Highstreet 53               < what I want

Something street 91/Daniel  < what I have
Something street 91           < what I want

通常我想要的是街道名称（1-4个名字），后面是街道号码（如果有的话），然后是街道字母（1个字母）（如果有的话）。我只是无法让它正常工作。

这是我的代码（我知道，它很糟糕）：

import re

def address_regex(address):
    regex1 = re.compile("(\w+ ){1,4}(\d{1,4} ){1}(\w{1} )")
    regex2 = re.compile("(\w+ ){1,4}(\d{1,4} ){1}")
    regex3 = re.compile("(\w+ ){1,4}(\d){1,4}")
    regex4 = re.compile("(\w+ ){1,4}(\w+)")

    s1 = regex1.search(text)
    s2 = regex2.search(text)
    s3 = regex3.search(text)
    s4 = regex4.search(text)

    regex_address = ""

    if s1 != None:
        regex_address = s1.group()
    elif s2 != None:
        regex_address = s2.group()
    elif s3 != None:
        regex_address = s3.group()
    elif s4 != None:
        regex_address = s4.group()    
    else:
        regex_address = address

    return regex_address

我正在使用Python 3.4

Answer 1

我会在这里走出困境并假设在你的最后一个例子中你真的想要赶上91号，因为没有意义不这样做。

这是一个能够捕捉到你所有例子（以及你的最后一个，包括91个）的解决方案：

^([\p{L} ]+ \d{1,4}(?: ?[A-Za-z])?\b)

^在字符串开头
[\p{L} ]+属于＆＃34;字母＆＃34;的空格或unicode字符的字符类;类别，1-infinity时间
\d{1,4}数字，1-4次
(?: ?[A-Za-z])?非捕获组的可选空格和单个字母，0-1次

捕获组1是整个地址。我并不完全理解你的分组背后的逻辑，但你可以根据自己的喜好对它进行分组。

See demo

Answer 2

这适用于您提供的5个样本

^([a-z]+\s+)*(\d*(?=\s))?(\s+[a-z])*\b

将多线模式和不区分大小写设置为开。如果你的正则表达式支持它，那就是（？im）。

Answer 3

也许你喜欢更易读的Python版本（没有正则表达式）：

import string

names = [
    "Car Cycle 5 B Ap 1233",
    "Potato street 13 1 AB",
    "Chrome Safari 41 Ap 765",
    "Highstreet 53 Ap 2632/BH",
    "Something street 91/Daniel",
    ]

for name in names:
    result = []
    words = name.split()
    while any(words) and all(c in string.ascii_letters for c in words[0]):
        result += [words[0]]
        words = words[1:]
    if any(words) and all(c in string.digits for c in words[0]):
        result += [words[0]]
        words = words[1:]
    if any(words) and words[0] in string.ascii_uppercase:
        result += [words[0]]
        words = words[1:]
    print " ".join(result)

输出：

Car Cycle 5 B
Potato street 13
Chrome Safari 41
Highstreet 53
Something street

正则表达式python不会像我想要的那样工作

3 个答案:

See demo