Question

这是我的正则表达式 date_regex='\d{1,2}\/\d{1,2}\/\d{4}$'

这是我的dates_as_first_row数据框

我正在尝试过滤出日期列，但得到一个空（377,0）数据框。 date_column=dates_as_first_row.filter(regex=attempt,axis='columns')

Answer 1

您可以使用.str.match进行此操作。

如果您的列名为'0'，则看起来像这样：

indexer=df['0'].str.match('\d{1,2}\/\d{1,2}\/\d{4}$')
df[indexer]

如果要选择任何字符串列中包含模式的所有行，则可以执行以下操作：

#          v- select takes all object columns     
#                             v- apply the lambda expression to each of the selected object columns with True if the row contains the pattern in the specific column
#                                                                                                        v- if any of the column vectors contains True, return True, otherwise False (the column vectors contain the result of str.match which is a boolean)
indexer=df.select_dtypes('O').apply(lambda ser: ser.str.match('\d{1,2}\/\d{1,2}\/\d{4}$'), axis='index').any(axis='columns')
df[indexer]

但是请注意，仅当您的所有object列都实际存储字符串时，此方法才有效。如果您在创建数据框时让熊猫找出列本身的类型，通常就是这种情况。如果不是这种情况，则需要添加类型检查，以避免运行时错误：

import re

def filter_dates(ser):
    date_re= re.compile('\d{1,2}\/\d{1,2}\/\d{4}$')
    return ser.map(lambda val: type(val) == str and bool(date_re.match(val)))

df.select_dtypes('O').apply(filter_dates, axis='index').any(axis='columns')

如何使用正则表达式过滤数据框列？

1 个答案: