Question

我正在尝试通过文字或数字来检测价格。有没有办法使用正则表达式来确定这一点，否则其他方法会更好？

对于数字，我想出的正则表达式是^ \ d {0,8}（。\ d {1,4}）？$，我发现是here

可以使用正则表达式来检测文字价格，例如：515吗？我正在查看杂货发票，下面是一个示例，我想提取每种产品的价格和总价。我还想知道是否可以使用正则表达式提取单词价格？

大型杂货

商品ID AMNIL PARA 101103

票据编号：100000000070

日期：2012年5月16日上午1:07

不。项数：4金额（卢比）：415.65，

数量单位ItemName

2个没有的Amul冰淇淋-香草-1升装

2条）扑热息痛片500mg

1个没有封闭的牙膏-200克

1个没有吉列Mach3剃须刀片

总计

价格（卢比）

220.00

25.00

70.00

100.00

415.00

单词总价：四百一十五

Answer 1

您可以将此正则表达式用作seen here（与PCRE和python兼容）：

(?x)           # free-spacing mode
(?(DEFINE)
  # Within this DEFINE block, we'll define many subroutines
  # They build on each other like lego until we can define
  # a "big number"

  (?<one_to_9>  
  # The basic regex:
  # one|two|three|four|five|six|seven|eight|nine
  # We'll use an optimized version:
  # Option 1: four|eight|(?:fiv|(?:ni|o)n)e|t(?:wo|hree)|
  #                                          s(?:ix|even)
  # Option 2:
  (?:f(?:ive|our)|s(?:even|ix)|t(?:hree|wo)|(?:ni|o)ne|eight)
  ) # end one_to_9 definition

  (?<ten_to_19>  
  # The basic regex:
  # ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|
  #                                              eighteen|nineteen
  # We'll use an optimized version:
  # Option 1: twelve|(?:(?:elev|t)e|(?:fif|eigh|nine|(?:thi|fou)r|
  #                                             s(?:ix|even))tee)n
  # Option 2:
  (?:(?:(?:s(?:even|ix)|f(?:our|if)|nine)te|e(?:ighte|lev))en|
                                          t(?:(?:hirte)?en|welve)) 
  ) # end ten_to_19 definition

  (?<two_digit_prefix>
  # The basic regex:
  # twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety
  # We'll use an optimized version:
  # Option 1: (?:fif|six|eigh|nine|(?:tw|sev)en|(?:thi|fo)r)ty
  # Option 2:
  (?:s(?:even|ix)|t(?:hir|wen)|f(?:if|or)|eigh|nine)ty
  ) # end two_digit_prefix definition

  (?<one_to_99>
  (?&two_digit_prefix)(?:[- ](?&one_to_9))?|(?&ten_to_19)|
                                              (?&one_to_9)
  ) # end one_to_99 definition

  (?<one_to_999>
  (?&one_to_9)[ ]hundred(?:[ ](?:and[ ])?(?&one_to_99))?|
                                            (?&one_to_99)
  ) # end one_to_999 definition

  (?<one_to_999_999>
  (?&one_to_999)[ ]thousand(?:[ ](?&one_to_999))?|
                                    (?&one_to_999)
  ) # end one_to_999_999 definition

  (?<one_to_999_999_999>
  (?&one_to_999)[ ]million(?:[ ](?&one_to_999_999))?|
                                   (?&one_to_999_999)
  ) # end one_to_999_999_999 definition

  (?<one_to_999_999_999_999>
  (?&one_to_999)[ ]billion(?:[ ](?&one_to_999_999_999))?|
                                   (?&one_to_999_999_999)
  ) # end one_to_999_999_999_999 definition

  (?<one_to_999_999_999_999_999>
  (?&one_to_999)[ ]trillion(?:[ ](?&one_to_999_999_999_999))?|
                                    (?&one_to_999_999_999_999)
  ) # end one_to_999_999_999_999_999 definition

  (?<bignumber>
  zero|(?&one_to_999_999_999_999_999)
  ) # end bignumber definition

  (?<zero_to_9>
  (?&one_to_9)|zero
  ) # end zero to 9 definition

  (?<decimals>
  point(?:[ ](?&zero_to_9))+
  ) # end decimals definition

) # End DEFINE


####### The Regex Matching Starts Here ########
(?&bignumber)(?:[ ](?&decimals))?

### Other examples of groups we could match ###
#(?&bignumber)
# (?&one_to_99)
# (?&one_to_999)
# (?&one_to_999_999)
# (?&one_to_999_999_999)
# (?&one_to_999_999_999_999)
# (?&one_to_999_999_999_999_999)

但这可能太过激了：）

考虑数据的结构，也许您可以尝试找出Total Price in Words :之后的内容

所以这样的事情可能对您有用：

^\h*Total Price in Words\s*:\s*(.*)

您将在第1组（通常是$1或\1）上找到数据

Demo

Answer 2

我建议使用https://regex101.com/codegen?language=python提供答案。

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"[\d]*"

test_str = "My text with numbers : 324 and 2342 1 3. G00d Luck!"

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

检测文字价格

2 个答案: