Question

使用python的re模块，我试图从以下语句中获取美元值：

＆＃34; $ 305,000 - $ 349,950＆＃34;应该给出这样的元组（305000,349950）
＆＃34;中期$ 2M的买家＆＃34; - ＆GT; （2000000）
＆＃34; ...买家指南$ 1.29M +＆＃34; - ＆GT; （1290000）
＆＃34; ... $ 485,000和$ 510,000＆＃34; - ＆GT; （485000,510000）

下面的模式适用于单个值，但如果有范围（如上面的第一个和最后一个点），它只给我最后一个数字（即349950和510000）。

user_is_logged_in?

尝试_pattern = r"""(?x) ^ .* (?P<target1> [€$£] \d{1,3} [,.]? \d{0,3} (?:[,.]\d{3})* (?P<multiplyer1>[kKmM]?\s?[mM]?) ) (?:\s(?:\-|\band\b|\bto\b)\s)? (?P<target2> [€$£] \d{1,3} [,.]? \d{0,3} (?:[,.]\d{3})* (?P<multiplyer2>[kKmM]?\s?[mM]?) )? .*? $ """时，target2始终显示为target2 = match.group("target2").strip()。

我绝不是一个regexpert，但却无法真正看到我在这里做错了什么。乘法器组起作用，对我来说，似乎target2组是相同的模式，即最后的可选匹配。

我希望我能够理解这一点......

Answer 1

你可以提出一些正则表达式逻辑与一个转换缩写数字的函数相结合。这是一些示例python代码：

# -*- coding: utf-8> -*-
import re, locale
from locale import *
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 

string = """"$305,000 - $349,950"
"Mid $2M's Buyers"
"... Buyers Guide $1.29M+"
"...$485,000 and $510,000"
"""

def convert_number(number, unit):
    if unit == "K":
        exp = 10**3
    elif unit == "M":
        exp = 10**6
    return (atof(number) * exp)

matches = []
rx = r"""
    \$(?P<value>\d+[\d,.]*)         # match a dollar sign 
                                    # followed by numbers, dots and commas
                                    # make the first digit necessary (+)
    (?P<unit>M|K)?                  # match M or K and save it to a group
    (                               # opening parenthesis
        \s(?:-|and)\s               # match a whitespace, dash or "and"
        \$(?P<value1>\d+[\d,.]*)    # the same pattern as above
        (?P<unit1>M|K)?
    )?                              # closing parethesis, 
                                    # make the whole subpattern optional (?)
"""
for match in re.finditer(rx, string, re.VERBOSE):
    if match.group('unit') is not None:
        value1 = convert_number(match.group('value'), match.group('unit'))
    else:
        value1 = atof(match.group('value'))
    m = (value1)
    if match.group('value1') is not None:
        if match.group('unit1') is not None:
            value2 = convert_number(match.group('value1'), match.group('unit1'))
        else:
            value2 = atof(match.group('value1'))
        m = (value1, value2)
    matches.append(m)

print matches
# [(305000.0, 349950.0), 2000000.0, 1290000.0, (485000.0, 510000.0)]

代码使用了相当多的逻辑，它首先为locale函数导入atof()模块，定义函数convert_number()并使用正则表达式搜索范围，这在码。您显然可以添加其他货币符号，例如€$£，但它们不在原始示例中。

Answer 2

+1使用详细模式进行正则表达式

模式开头的.*是贪婪的，所以它会尝试匹配整行。然后它回溯以匹配target1。模式中的其他所有内容都是可选的，因此将target1与行上的最后一个匹配匹配是成功匹配。您可以尝试通过添加“？”来使第一个.*不贪婪像这样：

_pattern = r"""(?x)
    ^
    .*?                   <-- add the ?
    (?P<target1>
    ... snip ...
    """

你能逐步增加吗？

_pattern = r"""(?x)
    (?P<target1>
        [€$£]
        \d{1,3}
        [,.]?
        \d{0,3}
        (?:[,.]\d{3})*
        (?P<multiplyer1>[kKmM]?\s?[mM]?)
    )
    (?P<more>\s(?:\-|\band\b|\bto\b)\s)?
    """

match = re.search(_pattern, line)
target1, more = match.groups()
if more:
    target2 = re.search(_pattern, line, start=match.end())

修改还有一个想法：尝试re.findall（）：

_pattern = r"""(?x) (?P<target1> [€$£] \d{1,3} [,.]? \d{0,3} (?:[,.]\d{3})* (?P<multiplyer1>[kKmM]?\s?[mM]?) ) """ targets = re.findall(_pattern, line)

正则表达式 - 匹配数字和可选范围

2 个答案: