Question

您好我是Python和RegEx的新手。我正在试验这两个并试图获得一个正则表达式来从用户提取数据但我期望不同的输入考虑打字错误等。所以在下面的代码我随机选择一些类型的字符串，我希望用户举例来说明他们如何输入数据。我只对美元之前或之后的数字感兴趣。例如：

ran = random.randint(1, 7)
print str(ran)
if ran == 1:
    examplestring = "This item costs 20 USD contact 9999999"
elif ran == 2:
    examplestring = "This item costs USD 20"
elif ran == 3:
    examplestring = "This item costs 20 U.S.D"
elif ran == 4:
    examplestring = "This item costs 20 usd"
elif ran == 5:
    examplestring = "This item costs 20 Usd call to buy : 954545577"
elif ran == 6:
    examplestring = "This item costs 20USD"
elif ran == 7:
    examplestring = "This item costs usd20"

regex = re.compile(r'\busd|\bu.s.d\b|\bu.s.d.\b', re.I)
examplestring = regex.sub("USD", examplestring)
costs = re.findall(r'\d+.\bUSD\b|\bUSD\b.\d+|\d+USD\b|\bUSD\d+', examplestring)
cost = "".join(str(n) for n in costs[0])
cost = ''.join(x for x in cost if x.isdigit())
print cost + " USD"

使用这些正则表达式，我可以获得我想要的详细信息，即“20美元”。我的问题是，如果我以正确的方式做到这一点，是否有可能使代码更好？

Answer 1

一种方法：

regex = re.compile(r'\b(?=[0-9U])(?:[0-9]+\s*U\.?S\.?D|U\.?S\.?D\s*[0-9]+)\b', re.I)

result = [x.strip(' USD.usd') for x in regex.findall(yourstring)]

模式细节：

\b         # word boundary
(?=[0-9U]) # only here to quickly discard word-boundaries not followed
           # by a digit or the letter U without to test the two branches
           # of the following alternation. You can remove it if you want.

(?:
    [0-9]+\s*U\.?S\.?D # USD after
  |                    # OR
    U\.?S\.?D\s*[0-9]+ # USD before
)
\b

请注意，两个分支的空格和点是可选的。

然后＆＃34; USD＆＃34;使用简单的条带删除结果的一部分。它比使用外观尝试从匹配结果中排除美元更方便（也可能更快）。

Answer 2

我建议Regex101获取更多信息并解释给定的正则表达式。特别是你应该注意组（如(\d+)），因为我认为这正是你正确提取值所需要的。

在我看来，替换然后搜索这个被替换的字符串有点混乱。

import re
lines = """This item costs 20 USD
This item costs USD 20
This item costs 20 U.S.D
This item costs 20 usd
This item costs 20 Usd
This item costs 20USD
This item costs usd20"""

# as you can see there are two groups with the price
pattern = re.compile(r"u\.?s\.?d\s*(\d+)|(\d+)\s*u\.?s\.?d", re.I)
# one of the groups must have matched, so I take the non-empty one using `or`operator
print ["{} USD".format(fst or sec) for fst, sec in pattern.findall(lines)]

输出：

['20 USD', '20 USD', '20 USD', '20 USD', '20 USD', '20 USD', '20 USD']

Answer 3

作为一个非常通用的解决方案[0-9]+只会提取金额，而忽略其周围的其他文字。它侧重于您需要提取的内容，而不是可能忽略的内容。

Answer 4

您可以使用单个正则表达式组直接提取值。例如＆＃34;（\ d +）* u \。？s \。？d \。？| u \。？s \。？d \。？ *（\ d +）＆＃34;可以用来搜索你的字符串（指定忽略大小写）然后，如果你得到一个匹配，你的成本将在第1组或第2组，具体取决于匹配的变体。

带有字符和数字的多个字符串的有效正则表达式

4 个答案: