我有这样的数据
genre_list
Out[7]:
0 [Action, Adventure, Fantasy, Sci-Fi]
1 [Action, Adventure, Fantasy]
2 [Action, Adventure, Thriller]
3 [Action, Thriller]
4 [Documentary]
5 [Action, Adventure, Sci-Fi]
6 [Action, Adventure, Romance]
7 [Adventure, Animation, Comedy, Family, Fantasy...
8 [Action, Adventure, Sci-Fi]
9 [Adventure, Family, Fantasy, Mystery]
10 [Action, Adventure, Sci-Fi]
11 [Action, Adventure, Sci-Fi]
我编码使Dataframe的列表大小不同
genre_df = pd.DataFrame()
for i in range(len(genre_list)):
genre_df = genre_df.append(pd.DataFrame(genre_list[i]).T)
得到这个
genre_df.head()
Out[9]:
0 1 2 3 4 5 6 7
0 Action Adventure Fantasy Sci-Fi NaN NaN NaN NaN
0 Action Adventure Fantasy NaN NaN NaN NaN NaN
0 Action Adventure Thriller NaN NaN NaN NaN NaN
0 Action Thriller NaN NaN NaN NaN NaN NaN
0 Documentary NaN NaN NaN NaN NaN NaN NaN
是否有一种获取Dataframe的简单方法....
答案 0 :(得分:1)
您可以使用DataFrame
构造函数,将genre_list
转换为numpy array
转换为values
,然后转换为list
:
df1 = pd.DataFrame(genre_list.values.tolist(), index=genre_list.index)
print (df1)
0 1 2 3 4
0 Action Adventure Fantasy Sci-Fi None
1 Action Adventure Fantasy None None
2 Action Adventure Thriller None None
3 Action Thriller None None None
4 Documentary None None None None
5 Action Adventure Sci-Fi None None
6 Action Adventure Romance None None
7 Adventure Animation Comedy Family Fantasy
8 Action Adventure Sci-Fi None None
9 Adventure Family Fantasy Mystery None
10 Action Adventure Sci-Fi None None
11 Action Adventure Sci-Fi None None
如果需要将None
替换为NaN
:
df1 = pd.DataFrame(genre_list.values.tolist(), index=genre_list.index).replace({None:np.nan})
print (df1)
0 1 2 3 4
0 Action Adventure Fantasy Sci-Fi NaN
1 Action Adventure Fantasy NaN NaN
2 Action Adventure Thriller NaN NaN
3 Action Thriller NaN NaN NaN
4 Documentary NaN NaN NaN NaN
5 Action Adventure Sci-Fi NaN NaN
6 Action Adventure Romance NaN NaN
7 Adventure Animation Comedy Family Fantasy
8 Action Adventure Sci-Fi NaN NaN
9 Adventure Family Fantasy Mystery NaN
10 Action Adventure Sci-Fi NaN NaN
11 Action Adventure Sci-Fi NaN NaN
另一个更慢的解决方案是apply
Series
:
df1 = genre_list.apply(pd.Series)
0 1 2 3 4
0 Action Adventure Fantasy Sci-Fi NaN
1 Action Adventure Fantasy NaN NaN
2 Action Adventure Thriller NaN NaN
3 Action Thriller NaN NaN NaN
4 Documentary NaN NaN NaN NaN
5 Action Adventure Sci-Fi NaN NaN
6 Action Adventure Romance NaN NaN
7 Adventure Animation Comedy Family Fantasy
8 Action Adventure Sci-Fi NaN NaN
9 Adventure Family Fantasy Mystery NaN
10 Action Adventure Sci-Fi NaN NaN
11 Action Adventure Sci-Fi NaN NaN
<强>计时强>:
#[12000 rows]
genre_list = pd.concat([genre_list]*1000).reset_index(drop=True)
In [115]: %timeit pd.DataFrame(genre_list.values.tolist(), index=genre_list.index).replace({None:np.nan})
100 loops, best of 3: 15.7 ms per loop
In [116]: %timeit df1 = genre_list.apply(pd.Series)
1 loop, best of 3: 1.96 s per loop
答案 1 :(得分:1)
numpy
方法
lol = s.values.tolist()
lens = [len(l) for l in lol]
i = np.arange(len(lens)).repeat(lens)
j = np.concatenate([np.arange(l) for l in lens])
v = np.concatenate(lol)
pd.Series(v, [i, j]).unstack()
0 1 2 3 4
0 Action Adventure Fantasy Sci-Fi None
1 Action Adventure Fantasy None None
2 Action Adventure Thriller None None
3 Action Thriller None None None
4 Documentary None None None None
5 Action Adventure Sci-Fi None None
6 Action Adventure Romance None None
7 Adventure Animation Comedy Family Fantasy
8 Action Adventure Sci-Fi None None
9 Adventure Family Fantasy Mystery None
10 Action Adventure Sci-Fi None None
11 Action Adventure Sci-Fi None None