使用Python忽略字符串中特定部分下的数据

时间:2019-05-06 19:48:44

标签: regex python-3.x string

我有一个如下所示的字符串

line="record of Students Name Codes:  AC1.123  XYZ12.67  the student is math major first hisory: XY12.34 good performer second history M12.78 N23.76 faculty Miss Cooper"

我想从该行中提取一些代码。我正在使用以下程序。我想忽略历史记录部分中的代码。

我可以知道如何忽略其中包含历史记录的部分中的代码

import re
regular_expression = re.compile(r'\b[A-Z]+\d{1,2}\.*\d{1,2}\w{0,2}\b', re.I)
matches = regular_expression.findall(line)
for match in matches:
    print (match)

预期产量

AC1.123
XYZ12.67

货币输出:

AC1.123
XYZ12.67
XY12.34
M12.78
N23.76

2 个答案:

答案 0 :(得分:1)

您可以匹配不需要的历史记录中的所有值,然后在组中捕获您想要的内容:

\bhistory:? [A-Z]+\d+\.\d+(?: [A-Z]+\d+\.\d+)*|([A-Z]+\d+\.\d+(?: [A-Z]+\d+)*)

说明

  • \bhistory:?字边界,匹配历史记录,可选的冒号和空格
  • [A-Z]+\d+\.\d+匹配a + z 1+次,1 +位数字,点文字和1+位数字
  • (?:非捕获组
    • [A-Z]+\d+\.\d+重复匹配前面的模式并加上一个空格
  • )*关闭非捕获组并重复0次以上
  • |
  • (捕获组
    • [A-Z]+\d+\.\d+与第一个图案匹配
    • (?: [A-Z]+\d+)*重复相同的模式,并在前面加上空格
  • )

Regex demo | Python demo

我认为hisory是一个错字,应该是history

例如:

import re
line = "record of Students Name Codes:  AC1.123  XYZ12.67  the student is math major first history: XY12.34 good performer second history M12.78 N23.76 faculty Miss Cooper"
regular_expression = re.compile(r'\bhistory:? [A-Z]+[0-9]+\.[0-9]+(?: [A-Z]+[0-9]+\.[0-9]+)*|([A-Z]+[0-9]+\.[0-9]+(?: [A-Z]+[0-9]+)*)', re.I)
matches = regular_expression.findall(line)
print(list(filter(None, matches)))

结果

  

['AC1.123','XYZ12.67']

答案 1 :(得分:0)

我不确定您想要的规则是什么,但这可能有助于您设计an expression

(AC|XYZ)([0-9]+.[0-9]+)

enter image description here

此图显示了这样的表达式如何工作:

enter image description here

示例测试

# -*- coding: UTF-8 -*-
import re

string = "record of Students Name Codes:  AC1.123  XYZ12.67  the student is math major first hisory: XY12.34 good performer second history M12.78 N23.76 faculty Miss Cooper"
expression = r'((AC|XYZ)([0-9]+.[0-9]+))'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(1) + "\" is a match  ")
else: 
    print(' Sorry! No matches! Something is not right! Call 911 ')