我有一个多索引panadas数据框,如下所示。这只是我遇到的问题的一个例子。实际上,此数据帧可能非常大,并且包含许多出现此问题的事件。
在第一行,index2
的值为2
,在最后一行index1
的值为2
。事实并非如此。结果,我需要将最后一行的index1
更改为1
,以便所有行都是index1
1
的一部分。
given_name
index1 index2
664627 766206 1
1297240 1
1429530 1
569874 396418 1
766206 1429531 1
169874 3697813 1
123456 1598742 1
1598742 19543864 1
所需的输出应如下所示:
given_name
index1 index2
664627 766206 1
1297240 1
1429530 1
1429531 1
569874 396418 1
169874 3697813 1
123456 1598742 1
19543864 1
理想情况下,解决方案应向量化且快速。我不必使用索引。数据框可以使用reset_index()
并作为列工作,然后将这些列重置为索引。
答案 0 :(得分:0)
我认为TestCommand
的第一级需要get_level_values
,由to_series
转换为MultiIndex
,以通过{创建的正向填充替换Series
{3}}与mask
,最后isin
:
NaN
答案 1 :(得分:0)
我找到了解决方法:
In [1]: import pandas as pd
...: import numpy as np
...:
...: string1 = """index1,index2,given_name
...: 1,2,1
...: 1,3,1
...: 1,4,1
...: 2,5,1
...: 6,7,1
...: 6,8,1
...: 7,9,1
...: 9,10,1
...: 10,11,1
...: 5,12,1
...: 12,13,1
...: 13,14,1"""
...:
...: df = pd.read_csv(pd.compat.StringIO(string1), index_col=[0,1])
...:
In [2]: df
Out[2]:
given_name
index1 index2
1 2 1
3 1
4 1
2 5 1
6 7 1
8 1
7 9 1
9 10 1
10 11 1
5 12 1
12 13 1
13 14 1
def find_fixing_rows(df):
df = df.reset_index()
# getting indexes of zeroth and first index
level_zero_indexs = np.unique(df.index1.values)
level_one_indexs = np.unique(df.index2.values)
# finding indexes that appear in both levels, these are ones that need fixing
intersect_index = np.intersect1d(level_zero_indexs, level_one_indexs)
# getting rows that need to be fixed using intersect_index
df_need_fix = df[df.index2.isin(intersect_index)]
return df_need_fix
def combine_missed_matches(df):
#df_need_fix = find_fixing_rows(df)
df = df.reset_index()
# getting indexes of zeroth and first index
level_zero_indexs = np.unique(df.index1.values)
level_one_indexs = np.unique(df.index2.values)
# finding indexes that appear in both levels, these are ones that need fixing
intersect_index = np.intersect1d(level_zero_indexs, level_one_indexs)
# getting rows that need to be fixed using intersect_index
df_need_fix = df[df.index2.isin(intersect_index)]
# joining fixed rows onto original dataframe to allow changing of indexes
df_with_need_fix_join = pd.merge(df,
df_need_fix,
left_on='index1',
right_on='index2',
how='left')
# logic to swap indexs
df_with_need_fix_join['index1_x'] = np.where(df_with_need_fix_join.index1_y.notnull(),
df_with_need_fix_join.index1_y,
df_with_need_fix_join.index1_x)
# dropping columns, renaming and tidying
df_with_need_fix_join = df_with_need_fix_join.drop(['index1_y',
'index2_y',
'given_name_y'],
axis=1)
df_with_need_fix_join = df_with_need_fix_join.rename(columns={
'index1_x' : 'index1',
'index2_x' : 'index2',
'given_name_x' : 'given_name'
})
df_with_need_fix_join.index1 = df_with_need_fix_join.index1.astype(np.int)
df_with_need_fix_join = df_with_need_fix_join.set_index(['index1', 'index2'])
return df_with_need_fix_join
def fix_missing_matches(df, condition=True):
while condition:
condition = find_fixing_rows(df).shape[0] > 0
df = combine_missed_matches(df)
df = df.sort_index()
return df
In [4]: df_fix = fix_missing_matches(df)
...:
...: df_fix
...:
Out[4]:
given_name
index1 index2
1 2 1
3 1
4 1
5 1
12 1
13 1
14 1
6 7 1
8 1
9 1
10 1
11 1