我们有以下数据
Name genres
A Action|Adventure|Science Fiction|Thriller
B Action|Adventure|Science Fiction|Thriller
C Adventure|Science Fiction|Thriller
我想要的数据使得我的数据框是
Name genres
A Action
A Adventure
A Science Fiction
A Thriller
B Action
B Adventure
B Science Fiction
B Thriller
C Adventure
C Science Fiction
C Thriller
这是我的代码
gen = df1[df1['genres'].str.contains('|')]
gen1 = gen.copy()
gen2 = gen.copy()
gen3 = gen.copy()
gen4 = gen.copy()
gen1['genres'] = gen1['genres'].apply(lambda x: x.split("|")[0])
gen2['genres'] = gen2['genres'].apply(lambda x: x.split("|")[1])
gen3['genres'] = gen3['genres'].apply(lambda x: x.split("|")[2])
gen4['genres'] = gen4['genres'].apply(lambda x: x.split("|")[3])
我遇到错误
IndexError:列表索引超出范围
答案 0 :(得分:1)
按split
创建流派列表,按repeat
划分str.len
值,按chain.from_iterable
划分最后的扁平化列表:
from itertools import chain
genres = df['genres'].str.split('|')
df = pd.DataFrame({
'Name' : df['Name'].values.repeat(genres.str.len()),
'genres' : list(chain.from_iterable(genres.tolist()))
})
print (df)
Name genres
0 A Action
1 A Adventure
2 A Science Fiction
3 A Thriller
4 B Action
5 B Adventure
6 B Science Fiction
7 B Thriller
8 C Adventure
9 C Science Fiction
10 C Thriller
编辑:
动态列数解决方案:
print (df)
Name genres col
0 A Action|Adventure|Science Fiction|Thriller 2
1 B Action|Adventure|Science Fiction|Thriller 3
2 C Adventure|Science Fiction|Thriller 5
from itertools import chain
cols = df.columns.difference(['genres'])
genres = df['genres'].str.split('|')
df = (df.loc[df.index.repeat(genres.str.len()), cols]
.assign(genres=list(chain.from_iterable(genres.tolist()))))
print (df)
Name col genres
0 A 2 Action
0 A 2 Adventure
0 A 2 Science Fiction
0 A 2 Thriller
1 B 3 Action
1 B 3 Adventure
1 B 3 Science Fiction
1 B 3 Thriller
2 C 5 Adventure
2 C 5 Science Fiction
2 C 5 Thriller