我有以下DataFrame:
x=$(echo "40125512|abcd32External_SOC=ALPHA3;PCRFabcran" | sed 's/\([^|]*\).*/\1/')
echo "$x"
40125512
y=$(echo "40125512|abcd32External_SOC=ALPHA3;PCRFabcran" | sed 's/.*=\([^;]*\).*/\1/')
echo "$y"
ALPHA3
我需要为每个id查找最新的日期和小时,例如,对于id = 1,我想要2019-10-21和4,而我却获得了正确的日期,但是hour = 5
答案 0 :(得分:1)
在所有3列中使用DataFrame.sort_values
,并在id
列中删除DataFrame.drop_duplicates
的重复项:
L = [{'date': '2019-10-21', 'hour': 3, 'id': '1'},
{'date': '2019-10-21', 'hour': 4, 'id': '1'},
{'date': '2019-10-20', 'hour': 0, 'id': '1'},
{'date': '2019-10-20', 'hour': 1, 'id': '1'},
{'date': '2019-10-21', 'hour': 0, 'id': '1'},
{'date': '2019-10-20', 'hour': 0, 'id': '1'},
{'date': '2019-10-19', 'hour': 5, 'id': '1'},
{'date': '2019-10-20', 'hour': 0, 'id': '2'},
{'date': '2019-10-20', 'hour': 0, 'id': '3'}]
df = pd.DataFrame(L)
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','date','hour'], ascending=[True, False, False]).drop_duplicates('id')
print (df)
date hour id
1 2019-10-21 4 1
7 2019-10-20 0 2
8 2019-10-20 0 3