连接相关字段并在数据框中替换

时间:2019-04-20 01:01:53

标签: python regex pandas

我正在串联整个大型数据集中的两个相关字段。我觉得我已拥有大部分需求,但无法正确地接洽各个领域。

数据帧:

id| date1foo| time1bar| date2foo| time2bar| date3foo | time3bar
--|---------|---------|---------|---------|----------|--------
2 |1/4/2017 |01:03:45 |1/4/2017 |01:03:45 |1/4/2019  |12:44:45
3 |2/4/2017 |03:12:32 |2/4/2017 |03:16:23 |3/4/2019  |22:32:55
4 |2/5/2017 |04:11:54 |7/5/2017 |06:23:31 |2/19/2019 |19:03:11
5 |2/6/2017 |02:15:34 |9/15/2017|01:12:32 |3/15/2019 |11:11:11
6 |3/17/2017|04:44:12 |10/3/2017|07:19:52 |4/4/2019  |07:03:14

我需要用新的合并字段替换这些字段。因此:

id| datetime1        | datetime2        | datetime3   
--|------------------|------------------|------------------|
2 |1/4/2017 01:03:45 |1/4/2017 01:03:45 |1/4/2019 12:44:45
3 |2/4/2017 03:12:32 |2/4/2017 03:16:23 |3/4/2019 22:32:55
4 |2/5/2017 04:11:54 |7/5/2017 06:23:31 |2/19/2019 19:03:11
5 |2/6/2017 02:15:34 |9/15/2017 01:12:32|3/15/2019 11:11:11
6 |3/17/2017 04:44:12|10/3/2017 07:19:52|4/4/2019 07:03:14

我觉得自己与下面的内容越来越接近。

代码:

pattern_date = re.compile("date[0-9]{1,2}foo")
pattern_time = re.compile("time[0-9]{1,2}bar")

cols_date = [pattern_date.match(x).group() for x in df.columns if
        pattern_date.match(x) is not None]

cols_time = [pattern_time.match(x).group() for x in df.columns if
        pattern_time.match(x) is not None]


df[cols_time] = df[cols_date].applymap(lambda x: str(x) + [i for i in df[cols_date]])

# renaming fields code would go here

我在这里想念什么?有一个更好的方法吗?任何帮助将非常感激。

谢谢!

1 个答案:

答案 0 :(得分:1)

我们可以使用DatFrame.filter并通过压缩它们来访问这些列,以便我们匹配datetime

df_new = pd.DataFrame({'id':df.id.values})

for index, cols in enumerate(zip(df.filter(regex='^date').columns, df.filter(regex='^time').columns)):
    df_new[f'datetime{index+1}'] = df[cols[0]] + ' ' + df[cols[1]]

print(df_new)
   id           datetime1           datetime2           datetime3
0   2   1/4/2017 01:03:45   1/4/2017 01:03:45   1/4/2019 12:44:45
1   3   2/4/2017 03:12:32   2/4/2017 03:16:23   3/4/2019 22:32:55
2   4   2/5/2017 04:11:54   7/5/2017 06:23:31  2/19/2019 19:03:11
3   5   2/6/2017 02:15:34  9/15/2017 01:12:32  3/15/2019 11:11:11
4   6  3/17/2017 04:44:12  10/3/2017 07:19:52   4/4/2019 07:03:14

DataFrame.filter到底是做什么的?它返回与正则表达式匹配的列:

print(df.filter(regex='^date'))
    date1foo   date2foo   date3foo
0   1/4/2017   1/4/2017   1/4/2019
1   2/4/2017   2/4/2017   3/4/2019
2   2/5/2017   7/5/2017  2/19/2019
3   2/6/2017  9/15/2017  3/15/2019
4  3/17/2017  10/3/2017   4/4/2019

print(df.filter(regex='^time'))

   time1bar  time2bar  time3bar
0  01:03:45  01:03:45  12:44:45
1  03:12:32  03:16:23  22:32:55
2  04:11:54  06:23:31  19:03:11
3  02:15:34  01:12:32  11:11:11
4  04:44:12  07:19:52  07:03:14

注意:我使用的f-strings仅受Python> 3.5支持。如果您的Python版本较低,请使用以下命令:

df_new['datetime{}'.format(index+1)]