如何使用正则表达式匹配按列对Pandas数据进行分组

时间:2017-03-27 01:39:45

标签: python regex pandas

我有以下数据框:

import pandas as pd
df = pd.DataFrame({'id':['a','b','c','d','e'],
                   'XX_111_S5_R12_001_Mobile_05':[-14,-90,-90,-96,-91],
                   'YY_222_S00_R12_001_1-999_13':[-103,0,-110,-114,-114],
                   'ZZ_111_S00_R12_001_1-999_13':[1,2.3,3,5,6],
})

df.set_index('id',inplace=True)
df

看起来像这样:

Out[6]:
    XX_111_S5_R12_001_Mobile_05  YY_222_S00_R12_001_1-999_13  ZZ_111_S00_R12_001_1-999_13
id
a                           -14                         -103                          1.0
b                           -90                            0                          2.3
c                           -90                         -110                          3.0
d                           -96                         -114                          5.0
e                           -91                         -114                          6.0

我想要做的是根据以下正则表达式对列进行分组:

\w+_\w+_\w+_\d+_([\w\d-]+)_\d+

最后,它按Mobile1-999分组。

这样做的方法是什么。我尝试了这个,但未能将它们分组:

import re
grouped = df.groupby(lambda x: re.search("\w+_\w+_\w+_\d+_([\w\d-]+)_\d+", x).group(), axis=1)
for name, group in grouped:
    print name
    print group

打印哪些:

XX_111_S5_R12_001_Mobile_05
YY_222_S00_R12_001_1-999_13
ZZ_111_S00_R12_001_1-999_13

我们想要的是name打印到:

Mobile
1-999
1-999

group打印相应的数据框。

3 个答案:

答案 0 :(得分:6)

您可以在列上使用.str.extract,以groupby# Performing the groupby. pat = '\w+_\w+_\w+_\d+_([\w\d-]+)_\d+' grouped = df.groupby(df.columns.str.extract(pat, expand=False), axis=1) # Showing group information. for name, group in grouped: print name print group, '\n'

1-999
    YY_222_S00_R12_001_1-999_13  ZZ_111_S00_R12_001_1-999_13
id                                                          
a                          -103                          1.0
b                             0                          2.3
c                          -110                          3.0
d                          -114                          5.0
e                          -114                          6.0 

Mobile
    XX_111_S5_R12_001_Mobile_05
id                             
a                           -14
b                           -90
c                           -90
d                           -96
e                           -91 

返回预期的组:

gcc main.c -o main
.main.c:10:24: warning: format specifies type 'long double' but the argument has type 'double' [-Wformat]
    printf("ld %#.9Lf\n", pow(2.71828182846L, 3.14159265359L));
               ~~~~~~     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
               %#.9f
1 warning generated.

答案 1 :(得分:1)

分组后,将新数据框的索引设置为[re.findall(r'\w+_\w+_\w+_\d+_([\w\d-]+)_\d+', col)[0] for col in df.columns]['Mobile', '1-999', '1-999'])。

答案 2 :(得分:1)

您的正则表达式存在一些问题,\w会匹配包含下划线的单词字符,如果您只想使用A-Za-z0-9-匹配字母和数字,则看起来不像您想要的那样会更好:

df.groupby(df.columns.str.extract("([A-Za-z0-9-]+)_\d+$"), axis=1).sum()

enter image description here