我有一个非常奇怪的Dataframe格式:
id Code Week1 Week2 week3
sunday nan nan nan nan
id Code Week1 Week2 week3
1 100 y y n
2 200 n y n
3 300 n n y
Monday nan nan nan nan
id Code Week1 Week2 week3
1 500 n y y
2 600 y y y
Tuesday nan nan nan nan
id Code Week1 Week2 week3
1 800 n y y
2 900 y n y
我想以这种格式提出来:
Code Day Week
100 Sunday 1
600 Monday 1
900 Tuesday 1
100 Sunday 2
200 Sunday 2
500 Monday 2
600 Monday 2
800 Tuesday 2
300 Sunday 3
500 Monday 3
600 Monday 3
800 Tuesday 3
900 Tuesday 3
即如果一周内代码的值为y,则该代码将在该周访问。
在熊猫中有没有办法做到这一点?
答案 0 :(得分:3)
不是我最好的工作......但我不想再尝试了......它伤害了我的灵魂。
d = df.query('id != "id"').replace(dict(id={'\d+': None}), regex=True).ffill()
s = d[d.duplicated('id')].set_index(['id', 'Code']).replace({'y': 1, 'n': np.nan}).stack()
s.rename_axis(['Day', 'Code', 'Week']).reset_index('Week').Week.str.replace(
'week', '', flags=re.IGNORECASE
).reset_index()
Day Code Week
0 sunday 100 1
1 sunday 100 2
2 sunday 200 2
3 sunday 300 3
4 Monday 500 2
5 Monday 500 3
6 Monday 600 1
7 Monday 600 2
8 Monday 600 3
9 Tuesday 800 2
10 Tuesday 800 3
11 Tuesday 900 1
12 Tuesday 900 3
答案 1 :(得分:1)
您可以使用:
df.index = df['id'].where(df['Code'].isnull()).ffill()
df = df[(df['Code'] != 'Code') & (df['id'] != df.index)]
df = df.rename_axis('Day').rename_axis('Week', 1)
df = df.set_index(['id','Code'], append=True)
.replace({'n':np.nan})
.stack().reset_index(name='val')
df['Week'] = df['Week'].str.extract('(\d+)', expand=False).astype(int)
cols = ['Code','Day','Week']
df = df.drop(['val','id'], axis=1)[cols].sort_values(['Week','Code']).reset_index(drop=True)
print (df)
Code Day Week
0 100 sunday 1
1 600 Monday 1
2 900 Tuesday 1
3 100 sunday 2
4 200 sunday 2
5 500 Monday 2
6 600 Monday 2
7 800 Tuesday 2
8 300 sunday 3
9 500 Monday 3
10 600 Monday 3
11 800 Tuesday 3
12 900 Tuesday 3
对于一般输出 - id
列,所有y
和n
值都会移除replace
:
df.index = df['id'].where(df['Code'].isnull()).ffill()
df = df[(df['Code'] != 'Code') & (df['id'] != df.index)]
df = df.rename_axis('Day').rename_axis('Week', 1)
df = df.set_index(['id','Code'], append=True).stack().reset_index(name='val')
df['Week'] = df['Week'].str.extract('(\d+)', expand=False).astype(int)
print (df)
Day id Code Week val
0 sunday 1 100 1 y
1 sunday 1 100 2 y
2 sunday 1 100 3 n
3 sunday 2 200 1 n
4 sunday 2 200 2 y
5 sunday 2 200 3 n
6 sunday 3 300 1 n
7 sunday 3 300 2 n
8 sunday 3 300 3 y
9 Monday 1 500 1 n
10 Monday 1 500 2 y
11 Monday 1 500 3 y
12 Monday 2 600 1 y
13 Monday 2 600 2 y
14 Monday 2 600 3 y
15 Tuesday 1 800 1 n
16 Tuesday 1 800 2 y
17 Tuesday 1 800 3 y
18 Tuesday 2 900 1 y
19 Tuesday 2 900 2 n
20 Tuesday 2 900 3 y
答案 2 :(得分:0)
基于@piRsquared's答案,对于那些想要伪单线的人来说
In [2689]: (df.query('id != "id"').replace(dict(id={'\d+': np.nan}), regex=True)
.assign(id=lambda x: x.ffill()).dropna()
.set_index(['id', 'Code'])
.replace({'y': 1, 'n': np.nan})
.rename(columns=lambda x: x.lower().replace('week', ''))
.stack()
.reset_index()
.rename(columns={'id': 'Day', 'level_2': 'Week'})
.drop(0, 1))
Out[2689]:
Day Code Week
0 sunday 100 1
1 sunday 100 2
2 sunday 200 2
3 sunday 300 3
4 Monday 500 2
5 Monday 500 3
6 Monday 600 1
7 Monday 600 2
8 Monday 600 3
9 Tuesday 800 2
10 Tuesday 800 3
11 Tuesday 900 1
12 Tuesday 900 3