正则表达式使用外观后面解析Buffy脚本

时间:2017-04-29 22:50:15

标签: python regex regex-lookarounds

我在解析此页面时遇到困难:http://www.buffyworld.com/buffy/transcripts/114_tran.html

我正在尝试使用相关对话获取角色名称。 文字如下:

<p>BUFFY: Wait!
<p>She stands there panting, watching the truck turn a corner.
<p>BUFFY: (whining) Don't you want your garbage?
<p>She sighs, pouts, turns and walks back toward the house.
<p>Cut to the kitchen. Buffy enters through the back door, holding a pile of
mail. She begins looking through it. We see Dawn standing by the island.
<p>DAWN: Hey Buffy. Oh, don't forget, today's trash day.<br>BUFFY: (sourly)
Thanks.
<p>Dawn piles her books into her school bag. Buffy opens a letter.
<p>Close shot of the letter.
<p>
<p>Dawn smiles, and she and Willow exit. Buffy picks up the still-wrapped
sandwich and stares at it.
<p>BUFFY: (to herself) Somebody should.
<p>She sighs, puts the sandwich back in the bag.
<p>Cut to the Bronze. Pan across various people drinking and dancing,
bartender serving. Reveal Xander and Anya sitting at the bar eating chips from
several bags. A notebook sits in front of them bearing the wedding seating
chart.
<p>ANYA: See ... this seating chart makes no sense. We have to do it again.
(Xander nodding) We can't do it again. You do it.<br>XANDER: The seating
chart's fine. Let's get back to the table arrangements. I'm starting to have
dreams of gardenia bouquets. (winces) I am so glad my manly coworkers didn't
just hear me say that. (eating chips)

理想情况下,我会从<p><br>匹配到下一个<p><br>。我试图使用展望未来并为此寻找支持:

reg = "((?<=<p>)|(?<=<br>))(?P<character>.+):(?P<dialogue>.+)((?=<p>)|(?=<br>))"
script = re.findall(reg, html_text)

不幸的是,这与任何事情都不相符。当我离开前瞻((?=<p>)|(?=<br>))时,只要匹配对话中没有换行符,我就会匹配行。它似乎终止于换行符而不是继续<p>

离。在这一行,“谢谢”不匹配。 <p>DAWN: Hey Buffy. Oh, don't forget, today's trash day.<br>BUFFY: (sourly) Thanks.

感谢您的任何见解!

1 个答案:

答案 0 :(得分:0)

解决点符号:

re.findall('((?<=<p>)|(?<=<br>))([A-Z]+):([^<]+)', text)

此外,您可以尝试special flag将换行符包含在点的语义中。就个人而言,当我可以使用拆分或一些HTML解析器。 RE逃避,所有参数,限制和标志可以驱使任何人疯狂。还有re.split。

dialogs = {}
text = html_text.replace('<br>', '<p>')
paragraphs = text.split('<p>')

for p in paragraphs:
    if ":" in p:
        char, line = p.split(":", 1)
        if char in dialogs:
           dialogs[char].append(line)
        else:
           dialogs[char] = []