如何在Python中使用正则表达式获取两个数字之间的所有文本?

时间:2016-01-27 18:42:25

标签: python regex

我有这种格式的文字:

文字

  

所有Eyez on Me Track Listing       #Title Artisttime 1 Ambitionz Az a Ridah 2Pac 4:39 2 All About U 2Pac 4:37 Fatal Yani Hadati Dru Down Snoop Dogg Nair Dogg Nate Dogg 3 Skandalouz 2Pac 4:09 Nate Dogg 4 Got My Mind Made Up 2Pac 5:13 Kurupt Redman Method Man Dat N Daz 5你怎么想要Jojo the Elf 4:47 2Pac 6 2 Amerikaz Most Wanted 2Pac 4:07 Snoop Dogg 7 No More Pain 2Pac 6:14 8 Heartz of Men 2Pac 4:43 9 Life Goes在2Pac 5:02 10只有上帝可以判断我拉平'4-Tay 4:57 2Pac 11 Tradin战争故事Nair Dogg 5:29风暴CPO C-BO Outlawz 2Pac 12加州爱[混音] Dr. Dre 6:25 2Pac Roger 13我不是疯狂的Cha 2Pac 4:53 Danny Boy 14 What'z Ya电话号码Danny Boy 5:10 2Pac 15(2)不能C我乔治克林顿5:30 2Pac 16(2)矮个子想成为暴徒2Pac 3:51 17(2)Holla at Me 2Pac 4:56 18(2)Wonda他们为何叫他们B____ 2Pac 4:19 19(2)我们骑Nair Dogg 5:09 2Pac 20(2)Thug Passion Outlawz 5:08 The Storm Dramarydal Jewell 2Pac 21(2)Picture Me Rollin'Danny Boy 5:15 2Pac CPO Big Syke 22(2)退房时间Big Syke 4 :39 Kurupt 2Pac 23(2)Ratha Be Ya N____ 2Pac 4:14 Richie Rich 24(2)All Eyez on Me Big Syke 5:08 2Pac 25(2)Run tha Streetz The Storm 5:17 Nair Dogg Michel'le 2Pac 26(2)不难2找到B-Legit 4:29 E-40 C-BO 2Pac Richie Rich 27 (2)天堂不难2找2Pac 3:58

由此我需要获得所有歌曲的标题。

到目前为止,我有

def extraction():

    f = open('Songs in Albums List.txt', 'r')
    str = 'Text All Eyez on Me Track Listing # Title Artisttime        1 Ambitionz Az a Ridah  2Pac 4:39' \
          '       2 All About U  2Pac 4:37              Fatal                 Yani Hadati                 ' \
          'Dru Down                 Snoop Dogg                 Nair Dogg                 Nate Dogg          ' \
          '3 Skandalouz  2Pac 4:09              Nate Dogg          4 Got My Mind Made Up  2Pac 5:13              ' \
          'Kurupt                 Redman                 Method Man                 Dat Nigga Daz          ' \
          '5 How Do You Want It  Jojo the Elf 4:47              2Pac          6 2 of Amerikaz Most Wanted  ' \
          '2Pac 4:07              Snoop Dogg          7 No More Pain  2Pac 6:14       8 Heartz of Men  2Pac 4:43       ' \
          '9 Life Goes On  2Pac 5:02       10 Only God Can Judge Me  Rappin 4-Tay 4:57              2Pac          ' \
          '11 Tradin War Stories  Nair Dogg 5:29              The Storm                 CPO                 C-BO' \
          '                 Outlawz                 2Pac          12 California Love [Remix]  Dr. Dre 6:25              ' \
          '2Pac                 Roger          13 I Aint Mad at Cha  2Pac 4:53              Danny Boy          ' \
          '14 Whatz Ya Phone No.  Danny Boy 5:10              2Pac          15 (2) Cant C Me  George Clinton 5:30' \
          '              2Pac          16 (2) Shorty Wanna Be a Thug  2Pac 3:51       17 (2) Holla at Me  2Pac 4:56' \
          '       18 (2) Wonda Why They Call U B____  2Pac 4:19       19 (2) When We Ride  Nair Dogg 5:09' \
          '              2Pac          20 (2) Thug Passion  Outlawz 5:08              The Storm                 ' \
          'Dramarydal                 Jewell                 2Pac          21 (2) Picture Me Rollin  Danny Boy 5:15' \
          '              2Pac                 CPO                 Big Syke          22 (2) Check Out Time  ' \
          'Big Syke 4:39              Kurupt                 2Pac          23 (2) Ratha Be Ya N____  2Pac 4:14' \
          '              Richie Rich          24 (2) All Eyez on Me  Big Syke 5:08              2Pac          ' \
          '25 (2) Run tha Streetz  The Storm 5:17              Nair Dogg                 Michelle                 ' \
          '2Pac          26 (2) Aint Hard 2 Find  B-Legit 4:29              E-40                 C-BO                 ' \
          '2Pac                 Richie Rich          27 (2) Heaven Aint Hard 2 Find  2Pac 3:58'


    st = " ".join(str.split())
    songs = re.findall(r'\d{0,3}(.+?):', st, re.I|re.M)
    # songs = songs.replace("\xc2\xa0", " ")
    s = " ".join(songs)
    s = s.replace("\xc2\xa0", " ")
    print s
    # s = re.sub("^\d+\s|\s\d+\s|\s\d+$", " ", s)
    print s
    t = re.findall(r'\s*[a-zA-Z0-9]\s*', s, re.I|re.M)
    x = []
    ind = []
    y = []
    z = 0
    for item in t:
        if len(item) > 2:
            y.append(z)
            x.append(t[t.index(item)])
            ind.append(t.index(item))
        z = z + 1
    print y
    new_x = []
    for string in x:
        new_x.append(string.split(' '));
    l = []

    for item in new_x:
        for val in item:
            l.append(filter(lambda space: space.strip(), val))
    # print l
    l = filter(lambda space: space.strip(), l)

    x = 0
    for vals in y:
        print vals
        t.pop(vals)
        t.insert(vals, l[y.index(vals)])
    print t[20], t[33], t[38], t[48]
    for vals in reversed(y):
        t.insert(vals+1, ' ')
    t = ''.join(t)
    t = re.findall(r'\d{0,3}\s*(.+)\s*\d', t, re.I|re.M)

    print t

返回如下字符串:

['Text All Eyez on Me Track Listing  Title Artisttime 1 Ambitionz Az a Ridah 2Pac 4 2 All About U 2Pac 4 Fatal Yani Hadati Dru Down Snoop Dogg Nair Dogg Nate Dogg 3 Skandalouz 2Pac 4 Nate Dogg 4 Got My Mind Made Up 2Pac 5 Kurupt Redman Method Man Dat N Daz 5 How Do You Want It Jojo the Elf 4 2Pac 6 2 of Amerikaz Most Wanted 2Pac 4 Snoop Dogg 7 No More Pain 2Pac 6 8 Heartz of Men 2Pac 4 9 Life Goes On 2Pac 5 10 Only God Can Judge Me Rappin 4Tay 4 2Pac 11 Tradin War Stories Nair Dogg 5 The Storm CPO CBO Outlawz 2Pac 12 California Love Remix Dr Dre 6 2Pac Roger 13 I Aint Mad at Cha 2Pac 4 Danny Boy 14 Whatz Ya Phone No Danny Boy 5 2Pac 15 2 Cant C Me George Clinton 5 2Pac 16 2 Shorty Wanna Be a Thug 2Pac 3 17 2 Holla at Me 2Pac 4 18 2 Wonda Why They Call U B 2Pac 4 19 2 When We Ride Nair Dogg 5 2Pac 20 2 Thug Passion Outlawz 5 The Storm Dramarydal Jewell 2Pac 21 2 Picture Me Rollin Danny Boy 5 2Pac CPO Big Syke 22 2 Check Out Time Big Syke 4 Kurupt 2Pac 23 2 Ratha Be Ya N 2Pac 4 Richie Rich 24 2 All Eyez on Me Big Syke 5 2Pac 25 2 Run tha Streetz The Storm 5 Nair Dogg Michelle 2Pac 26 2 Aint Hard 2 Find BLegit 4 E40 CBO 2Pac Richie Rich 27 2 Heaven Aint Hard 2 Find 2Pac ']

我希望在数字之间获取文本并过滤它们以查找歌曲。还有一种更好的方法可以将歌曲的标题放入列表中吗?

2 个答案:

答案 0 :(得分:0)

试试这个正则表达式r = re.split(r"\s+\d+\s+", str)

答案 1 :(得分:0)

为什么压力过大re

import re

blah = """All Eyez on Me Track Listing # Title Artisttime 1 Ambitionz Az a Ridah 2Pac 4:39 2 All About U 2Pac 4:37       Fatal           Yani Hadati           Dru Down           Snoop Dogg           Nair Dogg           Nate Dogg     3 Skandalouz 2Pac 4:09       Nate Dogg     4 Got My Mind Made Up 2Pac 5:13       Kurupt           Redman           Method Man           Dat N Daz     5 How Do You Want It Jojo the Elf 4:47       2Pac     6 2 of Amerikaz Most Wanted 2Pac 4:07       Snoop Dogg     7 No More Pain 2Pac 6:14 8 Heartz of Men 2Pac 4:43 9 Life Goes On 2Pac 5:02 10 Only God Can Judge Me Rappin' 4-Tay 4:57       2Pac     11 Tradin War Stories Nair Dogg 5:29       The Storm           CPO           C-BO           Outlawz           2Pac     12 California Love [Remix] Dr. Dre 6:25       2Pac           Roger     13 I Ain't Mad at Cha 2Pac 4:53       Danny Boy     14 What'z Ya Phone No. Danny Boy 5:10       2Pac     15 (2) Can't C Me George Clinton 5:30       2Pac     16 (2) Shorty Wanna Be a Thug 2Pac 3:51 17 (2) Holla at Me 2Pac 4:56 18 (2) Wonda Why They Call U B____ 2Pac 4:19 19 (2) When We Ride Nair Dogg 5:09       2Pac     20 (2) Thug Passion Outlawz 5:08       The Storm           Dramarydal           Jewell           2Pac     21 (2) Picture Me Rollin' Danny Boy 5:15       2Pac           CPO           Big Syke     22 (2) Check Out Time Big Syke 4:39       Kurupt           2Pac     23 (2) Ratha Be Ya N____ 2Pac 4:14       Richie Rich     24 (2) All Eyez on Me Big Syke 5:08       2Pac     25 (2) Run tha Streetz The Storm 5:17       Nair Dogg           Michel'le           2Pac     26 (2) Ain't Hard 2 Find B-Legit 4:29       E-40           C-BO           2Pac           Richie Rich     27 (2) Heaven Ain't Hard 2 Find 2Pac 3:58 Extra"""

def extraction2(s):
    s = re.sub(r'\s+', " ", s)
    tracks = []
    trackno = 1
    while 1:
        track = { "trackno" : trackno, "title" : "", "duration": None }
        # start of next track
        from_ = s.find(str(trackno))
        if from_ < 0:
            # last title has additional artists (not the case in the example)
            tracks[trackno-2]["title"] += " " + s.strip()
        else:
            if trackno > 1 and from_ > 0:
                # add "trailing" artists to previous track
                tracks[trackno-2]["title"] += s[:from_].strip()
            # time indicates end of track
            m = re.search(r'\d{1,2}:\d{2}', s[from_:])
            if m:
                line = s[from_:from_+m.end()].split(" ")
                track["title"] = " ".join(line[1:-1]).strip()
                track["duration"] = line[-1:][0]
                tracks.append(track)
        if not track["duration"]:
            break
        s = s[from_+m.end():]
        trackno += 1
    return tracks


tracklist = extraction2(blah)
import json
print json.dumps(tracklist, indent=4)

当我和@WiktorStribiżew在一起时,我也喜欢谜题;)

关于原始代码的注释:使用内置类型str的名称作为参数名称的风格不佳。