我有一个包含多行,标点符号和其他字边界的文本文件。
Example:
TITLE: Praying SINGER: Kesha
[Music video spoken intro:]
"Am I dead? Or is this one of those dreams? Those horrible dreams that seem
like they last forever? If I am alive, why? Why? If there is a God or
whatever, something, somewhere, why have I been abandoned by everyone and
everything I've ever known?
I've ever loved? Stranded. What is the lesson?
What is the point?
TITLE: Don't Stop the Party SINGER: Pitbull
I say, y'all having a good time, I'll bet
Yeah, yeah, yeah
Que no pare la fiesta
Don't stop the party
Yeah, yeah, yeah
Que no pare la fiesta
Don't stop the party
目标是在不包括标题或歌手的情况下拉出歌曲的歌词。我的文件中至少有10个,所以需要使用正则表达式来拉取所有10个。
content = re.findall(r'TITLE\:\s\w+(.*)?\s*SINGER', file, re.DOTALL)
答案 0 :(得分:0)
你的方法并非完全错误,但它需要更多的东西来实现令人厌恶的目标。以下模式捕获组$ 1中标题之间的文本:
(?:TITLE:\s.+?\s*SINGER:\s\w+?\s\r?\n?)(.+?)(?=\s+?TITLE:\s.+?\s*SINGER:\s\w+?\s+?|$)
[演示] [1]
import re
regex = r"(?:TITLE:\s.+?\s*SINGER:\s\w+?\s\r?\n?)(.+?)(?=\s+?TITLE:\s.+?\s*SINGER:\s\w+?\s+?|$)"
test_str = (" TITLE: Praying SINGER: Kesha\n\n"
" [Music video spoken intro:]\n"
" \"Am I dead? Or is this one of those dreams? Those horrible dreams that seem \n"
" like they last forever? If I am alive, why? Why? If there is a God or \n"
" whatever, something, somewhere, why have I been abandoned by everyone and \n"
" everything I've ever known? \n"
" I've ever loved? Stranded. What is the lesson? \n"
" What is the point? \n\n"
" TITLE: Don't Stop the Party SINGER: Pitbull\n\n"
" I say, y'all having a good time, I'll bet\n\n"
" Yeah, yeah, yeah\n"
" Que no pare la fiesta\n"
" Don't stop the party\n"
" Yeah, yeah, yeah\n"
" Que no pare la fiesta\n"
" Don't stop the party\n\n"
" TITLE: Don't Stop the Code SINGER: Ugly Kid George\n\n"
" Ding Dong\n\n"
" Yeah, yeah, yeah\n"
" Que no pare la fiesta\n"
" Don't stop the party\n"
" Yeah, yeah, yeah\n"
" Que no pare la fiesta\n"
" Don't stop the\n\n"
" Ding Dong")
matches = re.finditer(regex, test_str, re.DOTALL | re.UNICODE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
[1]: https://regex101.com/r/g1BG5e/1