我正在将多个JSON对象读入一个DataFrame。问题是某些列是列表。此外,数据非常大,因此我不能使用互联网上的可用解决方案。它们非常慢且内存效率低
以下是我的数据:
df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})
A B C D E
0 x1 [v1, v2] [c1, c2] [d1, d2] [e1, e2]
1 x2 [v3, v4] [c3, c4] [d3, d4] [e3, e4]
2 x3 [v5, v6] [c5, c6] [d5, d6] [e5, e6]
3 x4 [v7, v8] [c7, c8] [d7, d8] [e7, e8]
这是我数据的形状:(441079,12)
我想要的输出是:
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
.....
编辑:在被标记为重复之后,我想强调的是,在这个问题中,我一直在寻找一种爆炸多列的高效方法。因此,批准的答案能够有效地在非常大的数据集上爆炸任意数量的列。对另一个问题的答案没有做到的事情(这就是我在测试这些解决方案后问这个问题的原因)。
答案 0 :(得分:8)
在set_index
上使用A
,在剩余列apply
和stack
上使用值。所有这些都浓缩成一个单一的衬里。
In [1253]: (df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
Out[1253]:
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
答案 1 :(得分:5)
def explode(df, lst_cols, fill_value=''):
# make sure `lst_cols` is a list
if lst_cols and not isinstance(lst_cols, list):
lst_cols = [lst_cols]
# all columns except `lst_cols`
idx_cols = df.columns.difference(lst_cols)
# calculate lengths of lists
lens = df[lst_cols[0]].str.len()
if (lens > 0).all():
# ALL lists in cells aren't empty
return pd.DataFrame({
col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.loc[:, df.columns]
else:
# at least one list in cells is empty
return pd.DataFrame({
col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
.loc[:, df.columns]
用法:
In [82]: explode(df, lst_cols=list('BCDE'))
Out[82]:
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
答案 2 :(得分:2)
假设所有列都具有相同数量的列表,则可以在每一列上调用Series.explode
。
df.set_index(['A']).apply(pd.Series.explode).reset_index()
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
想法是将必须不首先爆炸的所有列设置为索引,然后再重置索引。
它也更快。
%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
答案 3 :(得分:2)
基于@ cs95的答案,我们可以在if
函数中使用lambda
子句,而不用将所有其他列设置为index
。这具有以下优点:
x.name in [...]
轻松指定列,也可以不修改x.name not in [...]
来指定列。df.apply(lambda x: x.explode() if x.name in ['B', 'C', 'D', 'E'] else x)
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
2 x3 v5 c5 d5 e5
2 x3 v6 c6 d6 e6
3 x4 v7 c7 d7 e7
3 x4 v8 c8 d8 e8
答案 4 :(得分:0)
这是我使用“应用”功能的解决方案。主要特点/差异:
注意:选项“trim”是根据我的需要开发的,超出了这个问题的范围
def cell_size_equalize2(row, cols='', fill_mode='internal', fill_value=''):
jcols = [j for j,v in enumerate(row.index) if v in cols]
if len(jcols)<1:
jcols = range(len(row.index))
Ls = [lenx(x) for x in row.values]
if not Ls[:-1]==Ls[1:]:
vals = [v if isinstance(v,list) else [v] for v in row.values]
if fill_mode=='external':
vals = [[e] + [fill_value]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
else e + [fill_value]*(max(Ls)-lenx(e))
for j,e in enumerate(vals)]
elif fill_mode == 'internal':
vals = [[e]+[e]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
else e+[e[-1]]*(max(Ls)-lenx(e))
for j,e in enumerate(vals)]
else:
vals = [e[0:min(Ls)] for e in vals]
row = pd.Series(vals,index=row.index.tolist())
return row
示例:
df=pd.DataFrame({
'a':[[1],2,3],
'b':[[4,5,7],[5,4],4],
'c':[[4,5],5,[6]]
})
print(df)
df1 = df.apply(cell_size_equalize2, cols='', fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', all columns, fill_value = \'OK\'\n', df1)
df2 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', cols = [\'a\', \'b\'], fill_value = \'OK\'\n', df2)
df3 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='internal', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'internal\', cols = [\'a\', \'b\']\n', df3)
df4 = df.apply(cell_size_equalize2, cols='', fill_mode='trim', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'trim\', all columns\n', df4)
输出:
a b c
0 [1] [4, 5, 7] [4, 5]
1 2 [5, 4] 5
2 3 4 [6]
fill_mode='external', all columns, fill_value = 'OK'
a b c
0 1 4 4
0 OK 5 5
0 OK 7 OK
1 2 5 5
1 OK 4 OK
2 3 4 6
fill_mode='external', cols = ['a', 'b'], fill_value = 'OK'
a b c
0 1 4 [4, 5]
0 OK 5 OK
0 OK 7 OK
1 2 5 5
1 OK 4 OK
2 3 4 6
fill_mode='internal', cols = ['a', 'b']
a b c
0 1 4 [4, 5]
0 1 5 [4, 5]
0 1 7 [4, 5]
1 2 5 5
1 2 4 5
2 3 4 6
fill_mode='trim', all columns
a b c
0 1 4 4
1 2 5 5
2 3 4 6