我有一个问题,即从一行中的多个列展平或折叠数据框,其中包含有关多个行的键的信息,每个行都具有相同的键列和相应的数据。假设数据帧是这样的:
df = pd.DataFrame({'CODE': ['AA', 'BB', 'CC'],
'START_1': ['1990-01-01', '2000-01-01', '2005-01-01'],
'END_1': ['1990-02-14', '2000-03-01', '2005-12-31'],
'MEANING_1': ['SOMETHING', 'OR', 'OTHER'],
'START_2': ['1990-02-15', None, '2006-01-01'],
'END_2': ['1990-06-14', None, '2006-12-31'],
'MEANING_2': ['ELSE', None, 'ANOTHER']})
CODE START_1 END_1 MEANING_1 START_2 END_2 MEANING_2
0 AA 1990-01-01 1990-02-14 SOMETHING 1990-02-15 1990-06-14 ELSE
1 BB 2000-01-01 2000-03-01 OR None None None
2 CC 2005-01-01 2005-12-31 OTHER 2006-01-01 2006-12-31 ANOTHER
我需要把它变成这样的形式:
CODE START END MEANING
0 AA 1990-01-01 1990-02-14 SOMETHING
1 AA 1990-02-15 1990-06-14 ELSE
2 BB 2000-01-01 2000-03-01 OR
3 CC 2005-01-01 2005-12-31 OTHER
4 CC 2006-01-01 2006-12-31 ANOTHER
我有一个解决方案如下:
df_a = df[['CODE', 'START_1', 'END_1', 'MEANING_1']]
df_b = df[['CODE', 'START_2', 'END_2', 'MEANING_2']]
df_a = df_a.rename(index=str, columns={'CODE': 'CODE',
'START_1': 'START',
'END_1': 'END',
'MEANING_1': 'MEANING'})
df_b = df_b.rename(index=str, columns={'CODE': 'CODE',
'START_2': 'START',
'END_2': 'END',
'MEANING_2': 'MEANING'})
df = pd.concat([df_a, df_b], ignore_index=True)
df = df.dropna(axis=0, how='any')
产生所需的结果。当然,如果你有超过2个需要折叠的列组(实际上我的实际代码中有6个),这看起来并不是非常pythonic,显然不理想。我已经检查了groupby()
,melt()
和stack()
方法,但还没有真正发现它们非常有用。任何建议将不胜感激。
答案 0 :(得分:4)
pd.wide_to_long(df, stubnames=['END', 'MEANING', 'START'],
i='CODE', j='Number', sep='_', suffix='*')
输出:
END MEANING START
CODE Number
AA 1 1990-02-14 SOMETHING 1990-01-01
BB 1 2000-03-01 OR 2000-01-01
CC 1 2005-12-31 OTHER 2005-01-01
AA 2 1990-06-14 ELSE 1990-02-15
BB 2 None None None
CC 2 2006-12-31 ANOTHER 2006-01-01
然后,如果您愿意,我们可以删除Number column / index和dropna's,例如df.reset_index().drop('Number', 1)
。
答案 1 :(得分:3)
melt
将实现此目的
df1=df.melt('CODE')
df1[['New','New2']]=df1.variable.str.split('_',expand=True)
df1.set_index(['CODE','New2','New']).value.unstack()
Out[492]:
New END MEANING START
CODE New2
AA 1 1990-02-14 SOMETHING 1990-01-01
2 1990-06-14 ELSE 1990-02-15
BB 1 2000-03-01 OR 2000-01-01
2 None None None
CC 1 2005-12-31 OTHER 2005-01-01
2 2006-12-31 ANOTHER 2006-01-01
答案 2 :(得分:0)
这是一种方法。这类似于您的逻辑,我稍微优化了一下并清理了代码,因此您只需要维护common_cols
,var_cols
,data_count
。
common_cols = ['CODE']
var_cols = ['START', 'END', 'MEANING']
data_count = 2
dfs = {i: df[common_cols + [k+'_'+str(int(i)) for k in var_cols]].\
rename(columns=lambda x: x.split('_')[0]) for i in range(1, data_count+1)}
pd.concat(list(dfs.values()), ignore_index=True)
# CODE START END MEANING
# 0 AA 1990-01-01 1990-02-14 SOMETHING
# 1 BB 2000-01-01 2000-03-01 OR
# 2 CC 2005-01-01 2005-12-31 OTHER
# 3 AA 1990-02-15 1990-06-14 ELSE
# 4 BB None None None
# 5 CC 2006-01-01 2006-12-31 ANOTHER
答案 3 :(得分:0)
这也应该有用。
# the following line get rid of _x suffix
df = df.set_index("CODE")
df.columns = pd.Index(map(lambda x : str(x)[:-2], df.columns)
pd.concat([df.iloc[:, range(len(df.columns))[i::2]] for i in range(2)])
删除后缀的方法取自Remove last two characters from column names of all the columns in Dataframe - Pandas
将方法扩展到每组超过2列应该很容易。说OP有OP。
pd.concat([df.iloc[:, range(len(df.columns))[i::6]] for i in range(6)])
答案 4 :(得分:0)
这是另一种方式:
df.columns = [i[0] for i in df.columns.str.split('_')]
df = df.T
cond = df.index.duplicated()
concat_df = pd.concat([df[~cond],df[cond]],axis=1).T
sort_df = concat_df.sort_values('START').iloc[:-1]
sort_df.CO = sort_df.CO.ffill()