我有这种格式的文字:
文字
所有Eyez on Me Track Listing #Title Artisttime 1 Ambitionz Az a Ridah 2Pac 4:39 2 All About U 2Pac 4:37 Fatal Yani Hadati Dru Down Snoop Dogg Nair Dogg Nate Dogg 3 Skandalouz 2Pac 4:09 Nate Dogg 4 Got My Mind Made Up 2Pac 5:13 Kurupt Redman Method Man Dat N Daz 5你怎么想要Jojo the Elf 4:47 2Pac 6 2 Amerikaz Most Wanted 2Pac 4:07 Snoop Dogg 7 No More Pain 2Pac 6:14 8 Heartz of Men 2Pac 4:43 9 Life Goes在2Pac 5:02 10只有上帝可以判断我拉平'4-Tay 4:57 2Pac 11 Tradin战争故事Nair Dogg 5:29风暴CPO C-BO Outlawz 2Pac 12加州爱[混音] Dr. Dre 6:25 2Pac Roger 13我不是疯狂的Cha 2Pac 4:53 Danny Boy 14 What'z Ya电话号码Danny Boy 5:10 2Pac 15(2)不能C我乔治克林顿5:30 2Pac 16(2)矮个子想成为暴徒2Pac 3:51 17(2)Holla at Me 2Pac 4:56 18(2)Wonda他们为何叫他们B____ 2Pac 4:19 19(2)我们骑Nair Dogg 5:09 2Pac 20(2)Thug Passion Outlawz 5:08 The Storm Dramarydal Jewell 2Pac 21(2)Picture Me Rollin'Danny Boy 5:15 2Pac CPO Big Syke 22(2)退房时间Big Syke 4 :39 Kurupt 2Pac 23(2)Ratha Be Ya N____ 2Pac 4:14 Richie Rich 24(2)All Eyez on Me Big Syke 5:08 2Pac 25(2)Run tha Streetz The Storm 5:17 Nair Dogg Michel'le 2Pac 26(2)不难2找到B-Legit 4:29 E-40 C-BO 2Pac Richie Rich 27 (2)天堂不难2找2Pac 3:58
由此我需要获得所有歌曲的标题。
到目前为止,我有
def extraction():
f = open('Songs in Albums List.txt', 'r')
str = 'Text All Eyez on Me Track Listing # Title Artisttime 1 Ambitionz Az a Ridah 2Pac 4:39' \
' 2 All About U 2Pac 4:37 Fatal Yani Hadati ' \
'Dru Down Snoop Dogg Nair Dogg Nate Dogg ' \
'3 Skandalouz 2Pac 4:09 Nate Dogg 4 Got My Mind Made Up 2Pac 5:13 ' \
'Kurupt Redman Method Man Dat Nigga Daz ' \
'5 How Do You Want It Jojo the Elf 4:47 2Pac 6 2 of Amerikaz Most Wanted ' \
'2Pac 4:07 Snoop Dogg 7 No More Pain 2Pac 6:14 8 Heartz of Men 2Pac 4:43 ' \
'9 Life Goes On 2Pac 5:02 10 Only God Can Judge Me Rappin 4-Tay 4:57 2Pac ' \
'11 Tradin War Stories Nair Dogg 5:29 The Storm CPO C-BO' \
' Outlawz 2Pac 12 California Love [Remix] Dr. Dre 6:25 ' \
'2Pac Roger 13 I Aint Mad at Cha 2Pac 4:53 Danny Boy ' \
'14 Whatz Ya Phone No. Danny Boy 5:10 2Pac 15 (2) Cant C Me George Clinton 5:30' \
' 2Pac 16 (2) Shorty Wanna Be a Thug 2Pac 3:51 17 (2) Holla at Me 2Pac 4:56' \
' 18 (2) Wonda Why They Call U B____ 2Pac 4:19 19 (2) When We Ride Nair Dogg 5:09' \
' 2Pac 20 (2) Thug Passion Outlawz 5:08 The Storm ' \
'Dramarydal Jewell 2Pac 21 (2) Picture Me Rollin Danny Boy 5:15' \
' 2Pac CPO Big Syke 22 (2) Check Out Time ' \
'Big Syke 4:39 Kurupt 2Pac 23 (2) Ratha Be Ya N____ 2Pac 4:14' \
' Richie Rich 24 (2) All Eyez on Me Big Syke 5:08 2Pac ' \
'25 (2) Run tha Streetz The Storm 5:17 Nair Dogg Michelle ' \
'2Pac 26 (2) Aint Hard 2 Find B-Legit 4:29 E-40 C-BO ' \
'2Pac Richie Rich 27 (2) Heaven Aint Hard 2 Find 2Pac 3:58'
st = " ".join(str.split())
songs = re.findall(r'\d{0,3}(.+?):', st, re.I|re.M)
# songs = songs.replace("\xc2\xa0", " ")
s = " ".join(songs)
s = s.replace("\xc2\xa0", " ")
print s
# s = re.sub("^\d+\s|\s\d+\s|\s\d+$", " ", s)
print s
t = re.findall(r'\s*[a-zA-Z0-9]\s*', s, re.I|re.M)
x = []
ind = []
y = []
z = 0
for item in t:
if len(item) > 2:
y.append(z)
x.append(t[t.index(item)])
ind.append(t.index(item))
z = z + 1
print y
new_x = []
for string in x:
new_x.append(string.split(' '));
l = []
for item in new_x:
for val in item:
l.append(filter(lambda space: space.strip(), val))
# print l
l = filter(lambda space: space.strip(), l)
x = 0
for vals in y:
print vals
t.pop(vals)
t.insert(vals, l[y.index(vals)])
print t[20], t[33], t[38], t[48]
for vals in reversed(y):
t.insert(vals+1, ' ')
t = ''.join(t)
t = re.findall(r'\d{0,3}\s*(.+)\s*\d', t, re.I|re.M)
print t
返回如下字符串:
['Text All Eyez on Me Track Listing Title Artisttime 1 Ambitionz Az a Ridah 2Pac 4 2 All About U 2Pac 4 Fatal Yani Hadati Dru Down Snoop Dogg Nair Dogg Nate Dogg 3 Skandalouz 2Pac 4 Nate Dogg 4 Got My Mind Made Up 2Pac 5 Kurupt Redman Method Man Dat N Daz 5 How Do You Want It Jojo the Elf 4 2Pac 6 2 of Amerikaz Most Wanted 2Pac 4 Snoop Dogg 7 No More Pain 2Pac 6 8 Heartz of Men 2Pac 4 9 Life Goes On 2Pac 5 10 Only God Can Judge Me Rappin 4Tay 4 2Pac 11 Tradin War Stories Nair Dogg 5 The Storm CPO CBO Outlawz 2Pac 12 California Love Remix Dr Dre 6 2Pac Roger 13 I Aint Mad at Cha 2Pac 4 Danny Boy 14 Whatz Ya Phone No Danny Boy 5 2Pac 15 2 Cant C Me George Clinton 5 2Pac 16 2 Shorty Wanna Be a Thug 2Pac 3 17 2 Holla at Me 2Pac 4 18 2 Wonda Why They Call U B 2Pac 4 19 2 When We Ride Nair Dogg 5 2Pac 20 2 Thug Passion Outlawz 5 The Storm Dramarydal Jewell 2Pac 21 2 Picture Me Rollin Danny Boy 5 2Pac CPO Big Syke 22 2 Check Out Time Big Syke 4 Kurupt 2Pac 23 2 Ratha Be Ya N 2Pac 4 Richie Rich 24 2 All Eyez on Me Big Syke 5 2Pac 25 2 Run tha Streetz The Storm 5 Nair Dogg Michelle 2Pac 26 2 Aint Hard 2 Find BLegit 4 E40 CBO 2Pac Richie Rich 27 2 Heaven Aint Hard 2 Find 2Pac ']
我希望在数字之间获取文本并过滤它们以查找歌曲。还有一种更好的方法可以将歌曲的标题放入列表中吗?
答案 0 :(得分:0)
试试这个正则表达式r = re.split(r"\s+\d+\s+", str)
答案 1 :(得分:0)
为什么压力过大re
?
import re
blah = """All Eyez on Me Track Listing # Title Artisttime 1 Ambitionz Az a Ridah 2Pac 4:39 2 All About U 2Pac 4:37 Fatal Yani Hadati Dru Down Snoop Dogg Nair Dogg Nate Dogg 3 Skandalouz 2Pac 4:09 Nate Dogg 4 Got My Mind Made Up 2Pac 5:13 Kurupt Redman Method Man Dat N Daz 5 How Do You Want It Jojo the Elf 4:47 2Pac 6 2 of Amerikaz Most Wanted 2Pac 4:07 Snoop Dogg 7 No More Pain 2Pac 6:14 8 Heartz of Men 2Pac 4:43 9 Life Goes On 2Pac 5:02 10 Only God Can Judge Me Rappin' 4-Tay 4:57 2Pac 11 Tradin War Stories Nair Dogg 5:29 The Storm CPO C-BO Outlawz 2Pac 12 California Love [Remix] Dr. Dre 6:25 2Pac Roger 13 I Ain't Mad at Cha 2Pac 4:53 Danny Boy 14 What'z Ya Phone No. Danny Boy 5:10 2Pac 15 (2) Can't C Me George Clinton 5:30 2Pac 16 (2) Shorty Wanna Be a Thug 2Pac 3:51 17 (2) Holla at Me 2Pac 4:56 18 (2) Wonda Why They Call U B____ 2Pac 4:19 19 (2) When We Ride Nair Dogg 5:09 2Pac 20 (2) Thug Passion Outlawz 5:08 The Storm Dramarydal Jewell 2Pac 21 (2) Picture Me Rollin' Danny Boy 5:15 2Pac CPO Big Syke 22 (2) Check Out Time Big Syke 4:39 Kurupt 2Pac 23 (2) Ratha Be Ya N____ 2Pac 4:14 Richie Rich 24 (2) All Eyez on Me Big Syke 5:08 2Pac 25 (2) Run tha Streetz The Storm 5:17 Nair Dogg Michel'le 2Pac 26 (2) Ain't Hard 2 Find B-Legit 4:29 E-40 C-BO 2Pac Richie Rich 27 (2) Heaven Ain't Hard 2 Find 2Pac 3:58 Extra"""
def extraction2(s):
s = re.sub(r'\s+', " ", s)
tracks = []
trackno = 1
while 1:
track = { "trackno" : trackno, "title" : "", "duration": None }
# start of next track
from_ = s.find(str(trackno))
if from_ < 0:
# last title has additional artists (not the case in the example)
tracks[trackno-2]["title"] += " " + s.strip()
else:
if trackno > 1 and from_ > 0:
# add "trailing" artists to previous track
tracks[trackno-2]["title"] += s[:from_].strip()
# time indicates end of track
m = re.search(r'\d{1,2}:\d{2}', s[from_:])
if m:
line = s[from_:from_+m.end()].split(" ")
track["title"] = " ".join(line[1:-1]).strip()
track["duration"] = line[-1:][0]
tracks.append(track)
if not track["duration"]:
break
s = s[from_+m.end():]
trackno += 1
return tracks
tracklist = extraction2(blah)
import json
print json.dumps(tracklist, indent=4)
当我和@WiktorStribiżew在一起时,我也喜欢谜题;)
关于原始代码的注释:使用内置类型str
的名称作为参数名称的风格不佳。