用正则表达式解析成绩单

时间:2017-10-01 14:19:14

标签: python regex parsing

我的文字格式与此示例类似:

  

PAUL:Lorem ipsum dolor坐下来,这是一个令人沮丧的人。 Aenean commodo> ligula eget dolor。

     

LEONARD:Aenean massa。 Cum sociis natoque penatibus et magnis dis parturient> montes,nascetur ridiculus mus。 Donec quam felis,ultricies nec,pellentesque> eu,pretium quis,sem。 Nulla consequat massa quis enim。 Donec pede justo,> fringilla vel,aliquet nec,vulputate eget,arcu。

     

EVIL NINJA [在屋顶上]:在enim justo,rhoncus ut,imperdiet a,venenatis> vitae,justo。 Nullam dictum felis eu pede mollis pretium。整数tincidunt。 > Cras dapibus。 Vivamus elementum semper nisi。 Aenean vulputate eleifend tellus。 > Aenean leo ligula,porttitor eu,consequat vitae,eleifend ac,enim。

     

PAUL [SCREAMING]:Aliquam lorem ante,dapibus in in,viverra quis,feugiat a,> tellus。

一个正则表达式,用于将成绩单解析为对话框。

[A-Z]+([:]|[ ]{1}[[][A-Z]*[]])

我正在尝试捕获所有定位器,以便正则表达式匹配

"PAUL:", 
"LEONARD [some context]:" 

正如您所见here我无法捕捉所有定位器。

  

EVIL NINJA [屋顶上]:

我如何捕捉上述内容?正则表达式甚至是正确的方法吗?

编辑:所有发言人姓名均为大写字母,以冒号结尾。这是我处理的所有成绩单的格式。

3 个答案:

答案 0 :(得分:3)

你的正则表达式的问题在于它不允许任何空格,所以它不匹配" EVIL NINJA"或者"在屋顶上#34;。

但是,是的,正则表达式绝对是正确的方法。你可以试试这个:

([A-Z][A-Z ]*)(?: \[([\w ]+)\])?:

用法:

regex = r'([A-Z][A-Z ]*)(?: \[([\w ]+)\])?:'

for match in re.finditer(regex, text):
    print('person:', match.group(1))
    print('context:', match.group(2))
    print()

输出:

person: PAUL
context: None

person: LEONARD
context: None

person: EVIL NINJA
context: on the roof

person: PAUL
context: SCREAMING

答案 1 :(得分:0)

[A-Z ]+(:|\[[a-zA-Z ]+\]:)

我认为你错了的是你在[] s中没有匹配小写字母,所以[on the roof]不匹配。我已将a-z添加到角色类中,现在它匹配。此外,您不允许在角色名称中使用空格,因此我将开头更改为[A-Z ]

try it here!

答案 2 :(得分:0)

正则表达式

"^([A-Z\s]+)(?:\[(?:[\w ]+)\])?:(.*?)$"
  • A-Z可以更改为\w
  • 要获取上下文(?:[\w ]+),请更改为([\w ]+)

代码

import re

regex = r"^([A-Z\s]+)(?:\[(?:[\w ]+)\])?:(.*?)$"

test_str = ("PAUL: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. \n\n"
        "LEONARD: Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. \n\n"
        "EVIL NINJA [on the roof]: In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim. \n\n"
        "PAUL [SCREAMING]: Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus. ")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

<强>输出

Match 1 was found at 0-100: PAUL: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor.     
Group 1 found at 0-4: PAUL
Group 2 found at 5-97:  Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor.

Match 2 was found at 100-381: LEONARD: Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. 
Group 1 found at 100-107: LEONARD
Group 2 found at 108-378:  Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu.

Match 3 was found at 381-684: EVIL NINJA [on the roof]: In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim.     
Group 1 found at 381-392: EVIL NINJA 
Group 2 found at 406-681:  In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim.

Match 4 was found at 684-767: PAUL [SCREAMING]: Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus. 
Group 1 found at 684-689: PAUL 
Group 2 found at 701-767:  Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus.