如何在两个单词之间拉多行

时间:2017-12-10 04:46:00

标签: regex python-3.x

我有一个包含多行,标点符号和其他字边界的文本文件。

 Example:
 TITLE: Praying  SINGER: Kesha

 [Music video spoken intro:]
 "Am I dead? Or is this one of those dreams? Those horrible dreams that seem 
 like they last forever? If I am alive, why? Why? If there is a God or 
 whatever, something, somewhere, why have I been abandoned by everyone and 
 everything I've ever known? 
 I've ever loved? Stranded. What is the lesson? 
 What is the point? 

 TITLE: Don't Stop the Party  SINGER: Pitbull

  I say, y'all having a good time, I'll bet

    Yeah, yeah, yeah
    Que no pare la fiesta
    Don't stop the party
    Yeah, yeah, yeah
    Que no pare la fiesta
    Don't stop the party

目标是在不包括标题或歌手的情况下拉出歌曲的歌词。我的文件中至少有10个,所以需要使用正则表达式来拉取所有10个。

content = re.findall(r'TITLE\:\s\w+(.*)?\s*SINGER', file, re.DOTALL)

1 个答案:

答案 0 :(得分:0)

你的方法并非完全错误,但它需要更多的东西来实现令人厌恶的目标。以下模式捕获组$ 1中标题之间的文本:

(?:TITLE:\s.+?\s*SINGER:\s\w+?\s\r?\n?)(.+?)(?=\s+?TITLE:\s.+?\s*SINGER:\s\w+?\s+?|$)

[演示] [1]

import re

regex = r"(?:TITLE:\s.+?\s*SINGER:\s\w+?\s\r?\n?)(.+?)(?=\s+?TITLE:\s.+?\s*SINGER:\s\w+?\s+?|$)"

test_str = (" TITLE: Praying  SINGER: Kesha\n\n"
    " [Music video spoken intro:]\n"
    " \"Am I dead? Or is this one of those dreams? Those horrible dreams that seem \n"
    " like they last forever? If I am alive, why? Why? If there is a God or \n"
    " whatever, something, somewhere, why have I been abandoned by everyone and \n"
    " everything I've ever known? \n"
    " I've ever loved? Stranded. What is the lesson? \n"
    " What is the point? \n\n"
    " TITLE: Don't Stop the Party  SINGER: Pitbull\n\n"
    "  I say, y'all having a good time, I'll bet\n\n"
    "    Yeah, yeah, yeah\n"
    "    Que no pare la fiesta\n"
    "    Don't stop the party\n"
    "    Yeah, yeah, yeah\n"
    "    Que no pare la fiesta\n"
    "    Don't stop the party\n\n"
    " TITLE: Don't Stop the Code  SINGER: Ugly Kid George\n\n"
    "  Ding Dong\n\n"
    "    Yeah, yeah, yeah\n"
    "    Que no pare la fiesta\n"
    "    Don't stop the party\n"
    "    Yeah, yeah, yeah\n"
    "    Que no pare la fiesta\n"
    "    Don't stop the\n\n"
    "   Ding Dong")

matches = re.finditer(regex, test_str, re.DOTALL | re.UNICODE)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))


  [1]: https://regex101.com/r/g1BG5e/1