采取以下玩具DataFrame:
data = np.arange(35, dtype=np.float32).reshape(7, 5)
data = pd.concat((
pd.DataFrame(list('abcdefg'), columns=['field1']),
pd.DataFrame(data, columns=['field2', '2014', '2015', '2016', '2017'])),
axis=1)
data.iloc[1:4, 4:] = np.nan
data.iloc[4, 3:] = np.nan
print(data)
field1 field2 2014 2015 2016 2017
0 a 0.0 1.0 2.0 3.0 4.0
1 b 5.0 6.0 7.0 NaN NaN
2 c 10.0 11.0 12.0 NaN NaN
3 d 15.0 16.0 17.0 NaN NaN
4 e 20.0 21.0 NaN NaN NaN
5 f 25.0 26.0 27.0 28.0 29.0
6 g 30.0 31.0 32.0 33.0 34.0
我想取代"年"列(2014-2017)有两个字段:最近的非空观察,以及该观察的相应年份。假设field1
是唯一的密钥。 (我不打算做任何groupby操作,每条记录只有1行。)I.e。:
field1 field2 obs date
0 a 0.0 4.0 2017
1 b 5.0 7.0 2015
2 c 10.0 12.0 2015
3 d 15.0 17.0 2015
4 e 20.0 21.0 2014
5 f 25.0 29.0 2017
6 g 30.0 34.0 2017
我已经走到了这一步:
pd.melt(data, id_vars=['field1', 'field2'],
value_vars=['2014', '2015', '2016', '2017'])\
.dropna(subset=['value'])
field1 field2 variable value
0 a 0.0 2014 1.0
1 b 5.0 2014 6.0
2 c 10.0 2014 11.0
3 d 15.0 2014 16.0
4 e 20.0 2014 21.0
5 f 25.0 2014 26.0
6 g 30.0 2014 31.0
# ...
但我正在努力如何转回到所需的格式。
答案 0 :(得分:4)
也许:
d2 = data.melt(id_vars=["field1", "field2"], var_name="date", value_name="obs").dropna(subset=["obs"])
d2["date"] = d2["date"].astype(int)
df = d2.loc[d2.groupby(["field1", "field2"])["date"].idxmax()]
给了我
field1 field2 date obs
21 a 0.0 2017 4.0
8 b 5.0 2015 7.0
9 c 10.0 2015 12.0
10 d 15.0 2015 17.0
4 e 20.0 2014 21.0
26 f 25.0 2017 29.0
27 g 30.0 2017 34.0
答案 1 :(得分:3)
以下apporach怎么样:
In [160]: df
Out[160]:
field1 field2 2014 2015 2016 2017
0 a 0.0 1.0 2.0 3.0 -10.0
1 b 5.0 6.0 7.0 NaN NaN
2 c 10.0 11.0 12.0 NaN NaN
3 d 15.0 16.0 17.0 NaN NaN
4 e 20.0 21.0 NaN NaN NaN
5 f 25.0 26.0 27.0 28.0 29.0
6 g 30.0 31.0 32.0 33.0 34.0
In [180]: df.groupby(lambda x: 'obs' if x.isdigit() else x, axis=1) \
...: .last() \
...: .assign(date=df.filter(regex='^\d{4}').loc[:, ::-1].notnull().idxmax(1))
Out[180]:
field1 field2 obs date
0 a 0.0 -10.0 2017
1 b 5.0 7.0 2015
2 c 10.0 12.0 2015
3 d 15.0 17.0 2015
4 e 20.0 21.0 2014
5 f 25.0 29.0 2017
6 g 30.0 34.0 2017
答案 2 :(得分:2)
last_valid_index
+ agg('last')
A=data.iloc[:,2:].apply(lambda x : x.last_valid_index(),1)
B=data.groupby(['value'] * data.shape[1], 1).agg('last')
data['date']=A
data['obs']=B
data
Out[1326]:
field1 field2 2014 2015 2016 2017 date obs
0 a 0.0 1.0 2.0 3.0 4.0 2017 4.0
1 b 5.0 6.0 7.0 NaN NaN 2015 7.0
2 c 10.0 11.0 12.0 NaN NaN 2015 12.0
3 d 15.0 16.0 17.0 NaN NaN 2015 17.0
4 e 20.0 21.0 NaN NaN NaN 2014 21.0
5 f 25.0 26.0 27.0 28.0 29.0 2017 29.0
6 g 30.0 31.0 32.0 33.0 34.0 2017 34.0
通过使用assign
,我们可以将它们推送到一行作为打击
data.assign(date=data.iloc[:,2:].apply(lambda x : x.last_valid_index(),1),obs=data.groupby(['value'] * data.shape[1], 1).agg('last'))
Out[1340]:
field1 field2 2014 2015 2016 2017 date obs
0 a 0.0 1.0 2.0 3.0 4.0 2017 4.0
1 b 5.0 6.0 7.0 NaN NaN 2015 7.0
2 c 10.0 11.0 12.0 NaN NaN 2015 12.0
3 d 15.0 16.0 17.0 NaN NaN 2015 17.0
4 e 20.0 21.0 NaN NaN NaN 2014 21.0
5 f 25.0 26.0 27.0 28.0 29.0 2017 29.0
6 g 30.0 31.0 32.0 33.0 34.0 2017 34.0
答案 3 :(得分:1)
另一种可能性是使用sort_values
和drop_duplicates
:
data.melt(id_vars=["field1", "field2"], var_name="date",
value_name="obs")\
.dropna(subset=['obs'])\
.sort_values(['field1', 'date'], ascending=[True, False])\
.drop_duplicates('field1', keep='first')
给你
field1 field2 date obs
21 a 0.0 2017 4.0
8 b 5.0 2015 7.0
9 c 10.0 2015 12.0
10 d 15.0 2015 17.0
4 e 20.0 2014 21.0
26 f 25.0 2017 29.0
27 g 30.0 2017 34.0