我正在使用Python Pandas来处理两个数据帧。第一个数据框包含来自客户数据库(名字,姓氏,电子邮件等)的记录。第二个数据框包含域名列表,例如gmail.com,hotmail.com等。
我正在尝试从电子邮件地址包含第二个列表中的域名时从客户数据框中排除记录。换句话说,当他们的电子邮件地址域出现在域黑名单中时,我需要删除客户。
以下是示例数据框:
>>> customer = pd.DataFrame({'Email': [
"bob@example.com",
"jim@example.com",
"joe@gmail.com"], 'First Name': [
"Bob",
"Jim",
"Joe"]})
>>> blacklist = pd.DataFrame({'Domain': ["gmail.com", "outlook.com"]})
>>> customer
Email First Name
0 bob@example.com Bob
1 jim@example.com Jim
2 joe@gmail.com Joe
>>> blacklist
Domain
0 gmail.com
1 outlook.com
我想要的输出是:
>>> filtered_list = magic_happens_here(customer, blacklist)
>>> filtered_list
Email First Name
0 bob@example.com Bob
1 jim@example.com Jim
到目前为止我已尝试过:
df1[df1['email'].isin(~df2['email'])
...但对我用过的用例没有帮助在这显然。df.apply
,但无法正确使用语法,我认为实际数据集的性能会很糟糕。示例:df1['Email'].apply(lambda x: x for i in ['gmail.com', 'outlook.com'] if i in x)
。虽然这似乎应该有用,但我得到了TypeError: 'generator' object is not callable
。 剩下的问题是:
答案 0 :(得分:3)
试试这个:
customer[~customer.Email.str.endswith(invalid_emails)]
或
customer[~customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)]
In [399]: filtered_list
Out[399]:
Email First Name
0 bob@example.com Bob
1 jim@example.com Jim
说明:
In [395]: customer.Email.str.replace(r'^[^@]*\@', '')
Out[395]:
0 example.com
1 example.com
2 gmail.com
Name: Email, dtype: object
In [396]: customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)
Out[396]:
0 False
1 False
2 True
Name: Email, dtype: bool
时间::对300K行DF:
In [401]: customer = pd.concat([customer] * 10**5)
In [402]: customer.shape
Out[402]: (300000, 2)
In [420]: %timeit customer[~customer.Email.str.endswith(invalid_emails)]
10 loops, best of 3: 136 ms per loop
In [421]: %timeit customer[customer['Email'].apply(lambda s: not s.endswith(invalid_emails))]
10 loops, best of 3: 151 ms per loop
In [422]: %timeit customer[~customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)]
1 loop, best of 3: 642 ms per loop
<强>结论:强>
customer[~customer.Email.str.endswith(invalid_emails)]
与customer[customer['Email'].apply(lambda s: not s.endswith(invalid_emails))]
相比有点快,而customer[~customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)]
多更慢
答案 1 :(得分:2)
代码 -
import pandas as pd
customer = pd.DataFrame({'Email': [
"bob@example.com",
"jim@example.com",
"joe@gmail.com"], 'First Name': [
"Bob",
"Jim",
"Joe"]})
blacklist = pd.DataFrame({'Domain': ["gmail.com", "outlook.com"]})
invalid_emails = tuple(blacklist['Domain'])
df = customer[customer['Email'].apply(lambda s: not s.endswith(invalid_emails))]
print(df)
输出 -
Email First Name
0 bob@example.com Bob
1 jim@example.com Jim