我在解析此页面时遇到困难:http://www.buffyworld.com/buffy/transcripts/114_tran.html
我正在尝试使用相关对话获取角色名称。 文字如下:
<p>BUFFY: Wait!
<p>She stands there panting, watching the truck turn a corner.
<p>BUFFY: (whining) Don't you want your garbage?
<p>She sighs, pouts, turns and walks back toward the house.
<p>Cut to the kitchen. Buffy enters through the back door, holding a pile of
mail. She begins looking through it. We see Dawn standing by the island.
<p>DAWN: Hey Buffy. Oh, don't forget, today's trash day.<br>BUFFY: (sourly)
Thanks.
<p>Dawn piles her books into her school bag. Buffy opens a letter.
<p>Close shot of the letter.
<p>
<p>Dawn smiles, and she and Willow exit. Buffy picks up the still-wrapped
sandwich and stares at it.
<p>BUFFY: (to herself) Somebody should.
<p>She sighs, puts the sandwich back in the bag.
<p>Cut to the Bronze. Pan across various people drinking and dancing,
bartender serving. Reveal Xander and Anya sitting at the bar eating chips from
several bags. A notebook sits in front of them bearing the wedding seating
chart.
<p>ANYA: See ... this seating chart makes no sense. We have to do it again.
(Xander nodding) We can't do it again. You do it.<br>XANDER: The seating
chart's fine. Let's get back to the table arrangements. I'm starting to have
dreams of gardenia bouquets. (winces) I am so glad my manly coworkers didn't
just hear me say that. (eating chips)
理想情况下,我会从<p>
或<br>
匹配到下一个<p>
或<br>
。我试图使用展望未来并为此寻找支持:
reg = "((?<=<p>)|(?<=<br>))(?P<character>.+):(?P<dialogue>.+)((?=<p>)|(?=<br>))"
script = re.findall(reg, html_text)
不幸的是,这与任何事情都不相符。当我离开前瞻((?=<p>)|(?=<br>))
时,只要匹配对话中没有换行符,我就会匹配行。它似乎终止于换行符而不是继续<p>
离。在这一行,“谢谢”不匹配。 <p>DAWN: Hey Buffy. Oh, don't forget, today's trash day.<br>BUFFY: (sourly)
Thanks.
感谢您的任何见解!
答案 0 :(得分:0)
解决点符号:
re.findall('((?<=<p>)|(?<=<br>))([A-Z]+):([^<]+)', text)
此外,您可以尝试special flag将换行符包含在点的语义中。就个人而言,当我可以使用拆分或一些HTML解析器。 RE逃避,所有参数,限制和标志可以驱使任何人疯狂。还有re.split。
dialogs = {}
text = html_text.replace('<br>', '<p>')
paragraphs = text.split('<p>')
for p in paragraphs:
if ":" in p:
char, line = p.split(":", 1)
if char in dialogs:
dialogs[char].append(line)
else:
dialogs[char] = []