Question

我使用python regex从给定的字符串中提取某些值。这是我的字符串：

mystring.txt

sometext
somemore    text here

some  other text

              course: course1
Id              Name                marks
____________________________________________________
1               student1            65
2               student2            75
3               MyName              69
4               student4            43

              course: course2
Id              Name                marks
____________________________________________________
1               student1            84
2               student2            73
8               student7            99
4               student4            32

              course: course4
Id              Name                marks
____________________________________________________
1               student1            97
3               MyName              60
8               student6            82

我需要提取特定学生的课程名称和相应的标记。例如，我需要从上面的字符串中找到MyName的课程和标记。

我试过了：

re.findall(".*?course: (\w+).*?MyName\s+(\d+).*?",buff,re.DOTALL)

但这只有在每个课程下都有MyName时才有效，但是如果在某些课程中缺少MyName，则不会这样，例如在我的示例字符串中。

这里输出为：[('course1', '69'), ('course2', '60')]

但实际上我想要的是：[('course1', '69'), ('course4', '60')]

这对于正确的正则表达式是什么？

#!/usr/bin/python    
import re

buffer_fp = open("mystring.txt","r+")
buff = buffer_fp.read()
buffer_fp.close()
print re.findall(".*?course: (\w+).*?MyName\s+(\d+).*?",buff,re.DOTALL)

Answer 1

.*?course: (\w+)(?:(?!\bcourse\b).)*MyName\s+(\d+).*?

                    ^^^^^^^^^^^^

你可以尝试一下。参见demo.Just使用一个基于前瞻的量词，它会在MyName之前搜索course。

https://regex101.com/r/pG1kU1/26

Answer 2

我怀疑在一个正则表达式中无法做到这一点。他们并非全能。

即使你找到了办法，也不要这样做。你的非工作正则表达已经接近不可读了;一个有效的解决方案可能更是如此。您最有可能只需几行有意义的代码即可完成此操作。伪代码解决方案：

for line in buff:
    if it is a course line:
        set the course variable
    if it is a MyName line:
        add (course, marks) to the list of matches

请注意，这可能（并且可能应该）涉及每个if块中的正则表达式。不是在锤子和螺丝刀之间进行选择而是排除另一个，而是将它们用于他们最擅长的事情。

Python：正则表达式

2 个答案: