使用正则表达式从数据框中删除/排除列-Python

时间:2019-06-28 11:57:46

标签: python regex python-3.x pandas dataframe

我有一个数据框,可以从下面的代码中生成

    df = pd.DataFrame({'person_id' :[1,2,3],'date1': ['12/31/2007','11/25/2009','10/06/2005'],'date1derived':[0,0,0],'val1':[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'date2derived':[0,0,0],'val2':[1,3,5],'date3':['12/31/2027','11/25/2029','10/06/2025'],'date3derived':[0,0,0],'val3':[7,9,11]})

数据框如下图所示

enter image description here

我想删除名称中包含“派生”的列。我尝试了其他正则表达式,但无法获得预期的输出。

    df = df.filter(regex='[^H\dDerived]+', axis=1)
    df = df.filter(regex='[^Derived]',axis=1)

您能告诉我正确的正则表达式吗?

5 个答案:

答案 0 :(得分:1)

df[[c for c in df.columns if 'derived' not in c ]]

输出

   person_id       date1  val1       date2  val2       date3  val3
0          1  12/31/2007     2  12/31/2017     1  12/31/2027     7
1          2  11/25/2009     4  11/25/2019     3  11/25/2029     9
2          3  10/06/2005     6  10/06/2015     5  10/06/2025    11

答案 1 :(得分:1)

您可以使用零宽度的负前瞻来确保字符串derived不会出现在任何地方:

^(?!.*?derived)
  • ^匹配字符串的开头
  • (?!.*?derived)是否定超前模式,可确保derived不出现在字符串中

您的模式[^Derived]将与D / e / r / i / v / e / d之一之外的任何单个字符匹配。

答案 2 :(得分:1)

IIUC,您要删除的列中包含derived。应该这样做:

df.drop(df.filter(like='derived').columns, 1)

Out[455]:
   person_id       date1  val1       date2  val2       date3  val3
0          1  12/31/2007     2  12/31/2017     1  12/31/2027     7
1          2  11/25/2009     4  11/25/2019     3  11/25/2029     9
2          3  10/06/2005     6  10/06/2015     5  10/06/2025    11

答案 3 :(得分:1)

pd.Index.difference()df.filter()

df[df.columns.difference(df.filter(like='derived').columns,sort=False)]

   person_id       date1  val1       date2  val2       date3  val3
0          1  12/31/2007     2  12/31/2017     1  12/31/2027     7
1          2  11/25/2009     4  11/25/2019     3  11/25/2029     9
2          3  10/06/2005     6  10/06/2015     5  10/06/2025    11

答案 4 :(得分:1)

在最新版本的熊猫中,可以在索引和列上使用字符串方法。在这里,str.endswith似乎很合适。

import pandas as pd

df = pd.DataFrame({'person_id' :[1,2,3],'date1': ['12/31/2007','11/25/2009','10/06/2005'],
                   'date1derived':[0,0,0],'val1':[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],
                   'date2derived':[0,0,0],'val2':[1,3,5],'date3':['12/31/2027','11/25/2029','10/06/2025'],
                   'date3derived':[0,0,0],'val3':[7,9,11]})

df = df.loc[:,~df.columns.str.endswith('derived')]

print(df)

O / P:

   person_id       date1  val1       date2  val2       date3  val3
0          1  12/31/2007     2  12/31/2017     1  12/31/2027     7
1          2  11/25/2009     4  11/25/2019     3  11/25/2029     9
2          3  10/06/2005     6  10/06/2015     5  10/06/2025    11