Question

我正在尝试从文本中提取一些数值。跳过是基于匹配的文本完成的。例如：

      Input Text - 
      ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST# 36479 GST percentage is 20%.
      OR
      ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST Reg No. 36479 GST% is 20%.
      OR
      ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST Reg# 36479 GST% is 20%.

      Output Text -
      Amount 400.00
      GST 36479
      GST 20%

要点是输入文本可以是任何格式，但输出文本应该是相同的。相同的一件事是 GST Number 将是非十进制数，GST 百分比将是数字后跟“%”符号，金额将采用十进制形式。

我尝试过但无法跳过 GST 后的非数字值。请帮忙。

我尝试了什么：

              pattern = re.compile(r"\b(?<=GST).\D(\d+)")

Answer 1

你可以使用

\bAmount\s*(?P<amount>\d+(?:\.\d+)?).*?\bGST\D*(?P<gst_id>\d+(?:\.\d+)?).*?\bGST\D*(?P<gst_prcnt>\d+(?:\.\d+)?%)

参见regex demo。详情：

\bAmount\s* - 整个单词 Amount 和零个或多个空格
(?P<amount>\d+(?:\.\d+)?) - 组“金额”：一位或多位数字，然后是 . 和一位或多位数字的可选序列
.*? - 一些文本（不包括空格）
\bGST - 一个词 GST
\D* - 零个或多个非数字字符
(?P<gst_id>\d+(?:\.\d+)?) - 组“gst_id”：一个或多个数字，然后是一个可选的 . 和一个或多个数字序列
.*? - 一些文本（不包括空格）
\bGST\D* - 一个单词 GST，然后是零个或多个非数字字符
(?P<gst_prcnt>\d+(?:\.\d+)?%) - 组“gst_prcnt”：一个或多个数字，然后是 . 和一个或多个数字的可选序列，然后是一个 % 字符。

见Python demo：

import re
pattern = r"\bAmount\s*(?P<amount>\d+(?:\.\d+)?).*?\bGST\D*(?P<gst_id>\d+(?:\.\d+)?).*?\bGST\D*(?P<gst_prcnt>\d+(?:\.\d+)?%)"

texts = ["ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST# 36479 GST percentage is 20%.",
"ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST Reg No. 36479 GST% is 20%.",
"ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST Reg# 36479 GST% is 20%."]

for text in texts:
    m = re.search(pattern, text)
    if m:
        print(m.groupdict())

输出：

{'amount': '400.00', 'gst_id': '36479', 'gst_prcnt': '20%'}
{'amount': '400.00', 'gst_id': '36479', 'gst_prcnt': '20%'}
{'amount': '400.00', 'gst_id': '36479', 'gst_prcnt': '20%'}

在正则表达式中跳过匹配

1 个答案: