根据日期和字符串长度在DataFrame中修改数据

时间:2018-12-18 08:21:16

标签: python python-3.x pandas dataframe

我需要清理Pandas DataFrame中的一些数据并为此苦苦挣扎。

样本数据:

Date       | ID     | Name             | Address
-----------------------------------------------------------------------------------------------
1-4-1987   | 124578 | T.Hilpert        | 518 Hessel Plaza Lake Lonzo, AZ 11863
23-6-1990  | 947383 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury
12-5-1960  | 746732 | Earline Schulist | 57367 Alfredo Vista East Bertaburgh
9-9-2010   | 947383 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury, WV 16241-5205
27-12-2017 | 124578 | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo

我想做的就是这个。按ID分组,从最近的日期获取名称,并获取最长的地址字符串。将这些用于所有出现的ID(在两个新列中:Name_newAddress_New)。请在下面找到所需的样本:

Date       | ID     | Name             | Address                                                | Name_New         | Address_New
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
27-12-2017 | 124578 | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo                            | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo, AZ 11863
1-4-1987   | 124578 | T. Hilpert       | 518 Hessel Plaza Lake Lonzo, AZ 11863                  | Theresia Hilpert | 518 Hessel Plaza Lake Lonzo, AZ 11863
23-6-1990  | 947383 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury                | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury, WV 16241-5205
9-9-2010   | 947383 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury, WV 16241-5205 | Birdie Reynolds  | 964 Weissnat Green Suite 568 Rennerbury, WV 16241-5205
12-5-1960  | 746732 | Earline Schulist | 57367 Alfredo Vista East Bertaburgh                    | Earline Schulist | 57367 Alfredo Vista East Bertaburgh

我已经尝试过了,但是无法将其组合起来以获得期望的结果。

def f1(s):
    return max(s, key=len)

df_new = df['New_Address'] = df.groupby('ID').agg({'Address': f1})


df_new = df[df.groupby('ID').Date.transform('max') == df['Date']]

特别感谢您的帮助。

1 个答案:

答案 0 :(得分:1)

使用transform来返回Series,其大小与原始DataFrame相同,然后按Name列创建索引,并按{{ 3}}:

Date