Question

我有一个像这样的数据框

import pandas as pd

df = pd.DataFrame({'a': ['abc', 'r00001', 'r00010', 'rfoo', 'r01234', 'r1234'], 'b': range(6)})

        a  b
0     abc  0
1  r00001  1
2  r00010  2
3    rfoo  3
4  r01234  4
5   r1234  5

我现在想要选择此数据框的所有列，其中a列中的条目以r开头，后跟五个数字。

From here我学会了如果只使用r而没有数字的话就会这样做：

print df.loc[df['a'].str.startswith('r'), :]

        a  b
1  r00001  1
2  r00010  2
3    rfoo  3
4  r01234  4
5   r1234  5

像这样的东西

print df.loc[df['a'].str.startswith(r'[r]\d{5}'), :]

当然不起作用。如何正确地做到这一点？

Answer 1

选项1
pd.Series.str.match

df.a.str.match('^r\d{5}$')

1     True
2     True
3    False
4     True
5    False
Name: a, dtype: bool

将其用作过滤器

df[df.a.str.match('^r\d{5}$')]

        a  b
1  r00001  1
2  r00010  2
4  r01234  4

选项2
使用字符串方法的自定义列表理解

f = lambda s: s.startswith('r') and (len(s) == 6) and s[1:].isdigit()
[f(s) for s in df.a.values.tolist()]

[False, True, True, False, True, False]

将其用作过滤器

df[[f(s) for s in df.a.values.tolist()]]

        a  b
1  r00001  1
2  r00010  2
4  r01234  4

计时

df = pd.concat([df] * 10000, ignore_index=True)

%timeit df[[s.startswith('r') and (len(s) == 6) and s[1:].isdigit() for s in df.a.values.tolist()]]
%timeit df[df.a.str.match('^r\d{5}$')]
%timeit df[df.a.str.contains('^r\d{5}$')]

10 loops, best of 3: 22.8 ms per loop
10 loops, best of 3: 33.8 ms per loop
10 loops, best of 3: 34.8 ms per loop

Answer 2

您可以使用str.contains并传递正则表达式模式：

In[112]:
df.loc[df['a'].str.contains(r'^r\d{5}')]

Out[112]: 
        a  b
1  r00001  1
2  r00010  2
4  r01234  4

此处模式评估为^r - 以字符r开头，然后\d{5}查找5位

startswith查找字符模式，而不是正则表达式模式，这就是失败的原因

关于str.contains和str.match之间的区别，它们是偶然的，str.contains使用re.search而str.match使用re.match更严格，请参阅docs。

修改

要回复您的评论，请添加$以使其与特定数量的字符匹配，请参阅related：

In[117]: df = pd.DataFrame({'a': ['abc', 'r000010', 'r00010', 'rfoo', 'r01234', 'r1234'], 'b': range(6)}) df Out[117]: a b 0 abc 0 1 r000010 1 2 r00010 2 3 rfoo 3 4 r01234 4 5 r1234 5 In[118]: df.loc[df['a'].str.match(r'r\d{5}$')] Out[118]: a b 2 r00010 2 4 r01234 4

使用正则表达式选择数据

2 个答案: