python:数据清理 - 检测欺诈性电子邮件地址的模式

时间:2017-06-16 14:24:21

标签: python data-cleaning

我正在清理我删除的欺诈性电子邮件地址的数据集。

我建立了多个捕获重复和欺诈域名的规则。但是有一个screnario,我无法想到如何在python中编写规则来标记它们。

所以我有这样的规则:

#delete punction
df['email'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))    

#flag yopmail
pattern = "yopmail"
match = df['email'].str.contains(pattern)
df['yopmail'] = np.where(match, 'Y', '0')

#flag duplicates
df['duplicate']=df.email.duplicated(keep=False)

这是我无法找出捕捉规则的数据。基本上我正在寻找一种方法来标记以相同方式开始的地址,但最后会有连续的数字。

abc7020@gmail.com
abc7020.1@gmail.com
abc7020.10@gmail.com
abc7020.11@gmail.com
abc7020.12@gmail.com
abc7020.13@gmail.com
abc7020.14@gmail.com
abc7020.15@gmail.com
attn1@gmail.com
attn12@gmail.com
attn123@gmail.com
attn1234@gmail.com
attn12345@gmail.com
attn123456@gmail.com
attn1234567@gmail.com
attn12345678@gmail.com

7 个答案:

答案 0 :(得分:2)

您可以使用正则表达式来执行此操作;以下示例:

import re

a = "attn12345@gmail.comf"
b = "abc7020.14@gmail.com"
c = "abc7020@gmail.com"
d = "attn12345678@gmail.com"

pattern = re.compile("[0-9]{3,500}\.?[0-9]{0,500}?@")

if pattern.search(a):
    print("spam1")

if pattern.search(b):
    print("spam2")

if pattern.search(c):
    print("spam3")

if pattern.search(d):
    print("spam4")

如果您运行代码,您将看到:

$ python spam.py 
spam1
spam2
spam3
spam4

这种方法的好处是它的标准化(正则表达式),你可以通过调整{}内的值来轻松调整匹配的强度;这意味着您可以在其中设置/调整值的全局配置文件。您还可以轻松调整正则表达式,而无需重写代码。

答案 1 :(得分:1)

首先看一下regexp问题here

其次,尝试按以下方式过滤电子邮件地址:

# Let's email is = 'attn1234@gmail.com'
email = 'attn1234@gmail.com'
email_name = email.split(',', maxsplit=1)[0]
# Here you get email_name = 'attn1234
import re
m = re.search(r'\d+$', email_name)
# if the string ends in digits m will be a Match object, or None otherwise.
if m is not None:
    print ('%s is good' % email)
else:
    print ('%s is BAD' % email) 

答案 2 :(得分:1)

您可以使用编辑距离(又名Levenshtein distance)选择差异阈值。在python:

$pip install editdistance
$ipython2
>>> import editdistance
>>> threshold = 5 # This could be anything, really
>>> data = ["attn1@gmail.com...", ...]# set up data to be the set you gave
>>> fraudulent_emails = set([email for email in data for _ in data if editdistance.eval(email, _) < threshold])

如果你想更聪明一点,你可以浏览结果列表,而不是把它变成一个集合,跟踪它附近有多少其他电子邮件地址 - 然后用它作为'权重'来确定假的。

这不仅可以获得给定的案例(欺诈性地址共享一个共同的起点,只是数字后缀不同,而且还有数字或字母填充,例如在电子邮件地址的开头或中间。

答案 3 :(得分:1)

ids = [s.split('@')[0] for s in email_list]
det = np.zeros((len(ids), len(ids)), dtype=np.bool)
for i in range(len(ids)):
    for j in range(i + 1, len(ids)):
        mi = ids[i]
        mj = ids[j]
        if len(mj) == len(mi) + 1 and mj.startswith(mi):
            try:
                int(mj[-1])
                det[j,i] = True
                det[i,j] = True
            except:
                continue

spam_indices = np.where(np.sum(det, axis=0) != 0)[0].tolist()

答案 4 :(得分:1)

我知道如何解决这个问题:

fuzzywuzzy

创建一组独特的电子邮件,循环播放它们并将它们与fuzzywuzzy进行比较。 例如:

from fuzzywuzzy import fuzz 

   for email in emailset:

      for row in data:
         emailcomp = re.search(pattern=r'(.+)@.+',string=email).groups()[0] 
         rowemail = re.search(pattern=r'(.+)@.+',string=row['email']).groups()[0] 
         if row['email']==email:
                  continue

          elif fuzz.partial_ratio(emailcomp,rowemail)>80:
                  'flagging operation'

我对如何表示数据采取了一些自由,但我觉得变量名称足够让你了解我所得到的内容。这是一段非常粗略的代码,因为我没有想过如何停止重复标记。

无论如何,elif部分比较没有@gmail.com的两个电子邮件地址(或任何其他电子邮件,例如@ yahoo.com),如果比率高于80(使用此数字),请使用您的标记操作。 例如:

fuzz.partial_ratio("abc7020.1", "abc7020")

100

答案 5 :(得分:1)

我的解决方案效率不高,也不漂亮。但看看它是否适合你@jeangelj。它肯定适用于您提供的示例。祝你好运!

import os
from random import shuffle
from difflib import SequenceMatcher

emails = [... ...] # for example the 16 email addresses you gave in your question
shuffle(emails) # everyday i'm shuffling
emails = sorted(emails) # sort that shit!
names = [email.split('@')[0] for email in emails]

T = 0.7 # <- set your string similarity threshold here!!

split_indices=[]
for i in range(1,len(emails)):
    if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T:
        split_indices.append(i) # we want to remember where dissimilar email address occurs

grouped=[]
for i in split_indices:
    grouped.append(emails[:i])
grouped.append(emails[i:])
# now we have similar email addresses grouped, we want to find the common prefix for each group
prefix_strings=[]
for group in grouped:
    prefix_strings.append(os.path.commonprefix(group))

# finally
ham=[]
spam=[]
true_ids = [names.index(p) for p in prefix_strings]
for i in range(len(emails)):
    if i in true_ids:
        ham.append(emails[i])
    else:
        spam.append(emails[i])

In [30]: ham
Out[30]: ['abc7020@gmail.com', 'attn1@gmail.com']

In [31]: spam
Out[31]: 
['abc7020.10@gmail.com',
 'abc7020.11@gmail.com',
 'abc7020.12@gmail.com',
 'abc7020.13@gmail.com',
 'abc7020.14@gmail.com',
 'abc7020.15@gmail.com',
 'abc7020.1@gmail.com',
 'attn12345678@gmail.com',
 'attn1234567@gmail.com',
 'attn123456@gmail.com',
 'attn12345@gmail.com',
 'attn1234@gmail.com',
 'attn123@gmail.com',
 'attn12@gmail.com']  

# THE TRUTH YALL!

答案 6 :(得分:1)

这是接近它的一种方法,应该非常有效。 我们通过将电子邮件地址分组为长度来实现,因此我们只需要检查每个电子邮件地址是否与级别相关,通过切片和设置成员资格检查。

代码:

首先,读入数据:

import pandas as pd
import numpy as np

string = '''
abc7020@gmail.com
abc7020.1@gmail.com
abc7020.10@gmail.com
abc7020.11@gmail.com
abc7020.12@gmail.com
abc7020.13@gmail.com
abc7020.14@gmail.com
abc7020.15@gmail.com
attn1@gmail.com
attn12@gmail.com
attn123@gmail.com
attn1234@gmail.com
attn12345@gmail.com
attn123456@gmail.com
attn1234567@gmail.com
attn12345678@gmail.com
foo123@bar.com
foo1@bar.com
'''

x = pd.DataFrame({'x':string.split()})
#remove duplicates:
x = x[~x.x.duplicated()]

我们剥离了@ foo.bar部分,然后将文件管理器仅删除那些以数字结尾的部分,然后添加“长度”列:

#split on @, expand means into two columns
emails =  x.x.str.split('@', expand = True)
#filter by last in string is a digit
emails = emails.loc[:,emails.loc[:,0].str[-1].str.isdigit()]
#add a length of email column for the next step
emails['lengths'] = emails.loc[:,0].str.len()

现在,我们所要做的就是取每个长度,长度为-1,然后查看长度。随着它的最后一个字符被删除,出现在一组n-1长度中(并且,我们必须检查相反的情况是否为真,如果它是最短的重复):

#unique lengths to check
lengths = emails.lengths.unique()
#mask to hold results
mask = pd.Series([0]*len(emails), index = emails.index)

#for each length
for j in lengths:
    #we subset those of that length
    totest = emails['lengths'] == j
    #and those who might be the shorter version
    against = emails['lengths'] == j -1
    #we make a set of unique values, for a hashed lookup
    againstset = set([i for i in emails.loc[against,0]])
    #we cut off the last char of each in to test
    tests = emails.loc[totest,0].str[:-1]
    #we check matches, by checking the set
    mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
    #viceversa, otherwise we miss the smallest one in the group
    againstset = set([i for i in emails.loc[totest,0].str[:-1]])
    tests = emails.loc[against,0]
    mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)

生成的掩码可以转换为布尔值,并用于对原始(重复数据删除)数据帧进行子集化,索引应该与原始索引匹配到以下子集:

x.loc[~mask.astype(bool),:]
    x
0   abc7020@gmail.com
16  foo123@bar.com
17  foo1@bar.com

您可以看到我们没有删除您的第一个值,因为'。'意味着它不匹配 - 你可以先删除标点符号。