我正在尝试为数据框上的以下转换获取合理的解决方案:
鉴于此数据框:
产生:
即将多索引的分组级别填充到标准长度(行数)
在某种较大的多索引数据帧(约几千列和约几百万行)上,是否有一种合理的快速方法?
以下是给定的数据框字典,以供快速参考:
d = {'region': {0: 'intro',
1: 'intro',
2: 'intro',
3: 'mid',
4: 'mid',
5: 'start',
6: 'start',
7: 'start',
8: 'title',
9: 'title'},
'feat_index': {0: 9, 1: 3, 2: 0, 3: 7, 4: 8, 5: 2, 6: 4, 7: 1, 8: 6, 9: 5},
'position_in_region': {0: 422,
1: 5834,
2: 8813,
3: 3187,
4: 9407,
5: 997,
6: 3154,
7: 8416,
8: 5408,
9: 8421},
'document_0': {0: 0.39,
1: 0.79,
2: 0.01,
3: 0.55,
4: 0.99,
5: 0.67,
6: 0.61,
7: 0.84,
8: 0.15,
9: 0.23},
'document_1': {0: 0.8,
1: 0.06,
2: 0.92,
3: 0.74,
4: 0.06,
5: 0.96,
6: 0.57,
7: 0.19,
8: 0.29,
9: 0.24},
'document_2': {0: 0.81,
1: 0.15,
2: 0.19,
3: 0.17,
4: 0.11,
5: 0.34,
6: 0.8,
7: 0.03,
8: 0.67,
9: 0.46}}
df = pd.DataFrame(d).set_index(['region', 'feat_index', 'position_in_region'])
答案 0 :(得分:2)
您可以将merge
与numpy.repeat
和numpy.tile
创建的帮助器DataFrame左联接一起使用:
#get number of new rows by Counter.most_common(1)
from collections import Counter
no_vals = Counter(df.index.labels[0]).most_common(1)[0][1]
print(no_vals)
3
df1 = pd.DataFrame({'region':np.repeat(df.index.levels[0], no_vals),
'id': np.tile(np.arange(no_vals), len(np.unique(df.index.labels[0])))})
print (df1)
region id
0 intro 0
1 intro 1
2 intro 2
3 mid 0
4 mid 1
5 mid 2
6 start 0
7 start 1
8 start 2
9 title 0
10 title 1
11 title 2
#MultiIndex to columns
df = df.reset_index()
#new could with counter of regions
df.insert(1, 'id', df.groupby('region').cumcount())
#merge, remove helper id columns and create MultiIndex
df = (df1.merge(df, how='left')
.drop('id', 1)
.set_index(['region', 'feat_index', 'position_in_region']))
print (df)
document_0 document_1 document_2
region feat_index position_in_region
intro 9.0 422.0 0.39 0.80 0.81
3.0 5834.0 0.79 0.06 0.15
0.0 8813.0 0.01 0.92 0.19
mid 7.0 3187.0 0.55 0.74 0.17
8.0 9407.0 0.99 0.06 0.11
NaN NaN NaN NaN NaN
start 2.0 997.0 0.67 0.96 0.34
4.0 3154.0 0.61 0.57 0.80
1.0 8416.0 0.84 0.19 0.03
title 6.0 5408.0 0.15 0.29 0.67
5.0 8421.0 0.23 0.24 0.46
NaN NaN NaN NaN NaN
使用DataFrame.reindex
和MultiIndex.from_product
的另一种解决方案:
from collections import Counter
no_vals = Counter(df.index.labels[0]).most_common(1)[0][1]
print(no_vals)
3
mux = pd.MultiIndex.from_product([df.index.levels[0],
np.arange(no_vals)], names=['region','id'])
print (mux)
MultiIndex(levels=[['intro', 'mid', 'start', 'title'], [0, 1, 2]],
codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3],
[0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
names=['region', 'id'])
df = df.reset_index(level=[1, 2]).set_index(df.groupby(level=0).cumcount(), append=True)
df = (df.reindex(mux).reset_index(level=1, drop=True)
.set_index(['feat_index', 'position_in_region'], append=True))
print (df)
document_0 document_1 document_2
region feat_index position_in_region
intro 9.0 422.0 0.39 0.80 0.81
3.0 5834.0 0.79 0.06 0.15
0.0 8813.0 0.01 0.92 0.19
mid 7.0 3187.0 0.55 0.74 0.17
8.0 9407.0 0.99 0.06 0.11
NaN NaN NaN NaN NaN
start 2.0 997.0 0.67 0.96 0.34
4.0 3154.0 0.61 0.57 0.80
1.0 8416.0 0.84 0.19 0.03
title 6.0 5408.0 0.15 0.29 0.67
5.0 8421.0 0.23 0.24 0.46
NaN NaN NaN NaN NaN