Question

我有一个非索引数据框，从csv文件读取超过50000行，如下所示：

John   Mullen  12/08/1993  Passw0rd
Lisa   Bush    06/12/1990  myPass12
Maria  Murphy  30/03/1989  qwErTyUi
Seth   Black   21/06/1991  LoveXmas

我想根据特定的正则表达式验证每行的每个单元格：

使用下面的PassRegex验证密码
使用下面的NameRegex
等...

然后将任何单元格未验证的行移动到新数据框。

import re
PassRegex = re.compile(r"^(?!.*\s)(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,50}$")
NameRegex = re.compile(r"^[a-zA-Z0-9\s\-]{2,80}$")

例如，在这种情况下，下面的行不会使用PassRegex验证，所以我想将它们移动到一个单独的数据框：

Maria  Murphy  30/03/1989  qwErTyUi
Seth   Black   21/06/1991  LoveXmas

有没有办法在不逐行迭代整个数据框的情况下逐个迭代地执行此操作？

非常感谢任何帮助。

Answer 1

您可以将正则表达式传递给str.contains：

In [36]:
passRegex = r"^(?!.*\s)(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,50}$"
nameRegex = r"^[a-zA-Z0-9\s\-]{2,80}$"
df[(df['password'].str.contains(passRegex, regex=True)) & (df['first'].str.contains(nameRegex, regex=True)) & (df['last'].str.contains(nameRegex, regex=True))]

Out[36]:
  first    last         dob  password
0  John  Mullen  12/08/1993  Passw0rd
1  Lisa    Bush  06/12/1990  myPass12

要仅保留感兴趣的行，这会为每个条件创建一个布尔掩码，并将&与and一起使用，由于运算符优先级，您需要使用括号

每个条件的输出：

In [37]:
df['password'].str.contains(passRegex, regex=True)

Out[37]:
0     True
1     True
2    False
3    False
Name: password, dtype: bool

In [38]:
df['first'].str.contains(nameRegex, regex=True)

Out[38]:
0    True
1    True
2    True
3    True
Name: first, dtype: bool

In [39]:
df['last'].str.contains(nameRegex, regex=True)

Out[39]:
0    True
1    True
2    True
3    True
Name: last, dtype: bool

然后当我们将它们结合起来时：

In [40]:
(df['password'].str.contains(passRegex, regex=True)) & (df['first'].str.contains(nameRegex, regex=True)) & (df['last'].str.contains(nameRegex, regex=True))

Out[40]:
0     True
1     True
2    False
3    False
dtype: bool

pandas：使用正则表达式验证数据帧单元格

1 个答案: