R中的数据清理:删除测试客户名称

时间:2017-12-29 14:21:18

标签: r data-cleaning

我正在处理具有客户名和姓的客户数据。我想清理任何随机击键的名字。测试帐户在数据集中混乱并且具有垃圾名称。例如,在下面的数据我想删除客户2,5,9,10,12等我很感激你的帮助。

 Customer Id    FirstName   LastName
1   MARY    MEYER
2   GFRTYUIO    UHBVYY
3   CHARLES BEAL
4   MARNI   MONTANEZ
5   GDTDTTD DTTHDTHTHTHD
6   TIFFANY BAYLESS
7   CATHRYN JONES
8   TINA    CUNNINGHAM
9   FGCYFCGCGFC FGCGFCHGHG
10  ADDHJSDLG   DHGAHG
11  WALTER  FINN
12  GFCTFCGCFGC CG GFCGFCGFCGF
13  ASDASDASD   AASDASDASD
14  TYKTYKYTKTY YTKTYKTYK
15  HFHFHF  HAVE
16  REBECCA CROSSWHITE
17  GHSGHG  HGASGH
18  JESSICA TREMBLEY
19  GFRTYUIO    UHBVYY
20  HUBHGBUHBUH YTVYVFYVYFFV
21  HEATHER WYRICK
22  JASON   SPLICHAL
23  RUSTY   OWENS
24  DUSTIN  WILLIAMS
25  GFCGFCFGCGFC    GRCGFXFGDGF
26  QWQWQW  QWQWWW
27  LIWNDVLIHWDV    LIAENVLIHEAV
28  DARLENE SHORTRIDGE
29  BETH    HDHDHDH
30  ROBERT  SHIELDS
31  GHERDHBXFH  DFHFDHDFH
32  ACE TESSSSSRT
33  ALLISON AWTREY
34  UYGUGVHGVGHVG   HGHGVUYYU
35  HCJHV   FHJSEFHSIEHF

4 个答案:

答案 0 :(得分:2)

问题似乎是你需要对不可能的名字进行一个可靠的定义,这与R没有关系。无论如何,我建议你用名字去掉所有那些不合理的名字。 。作为合理的名字或肯定列表的来源,您可以使用例如SSA Baby Name Database。这应该可以很好地过滤掉英文名字。如果您对名字有更多针对特定位置的需求,请在线查看其他婴儿名称数据库,并尝试将其作为肯定列表。

将它们放在名为positiveNames的向量中后,过滤掉所有非正面名称,如下所示:

data_new <- data_original[!data_original$firstName %in% positiveNames,]

答案 1 :(得分:2)

我的方法如下:

1)将FirstNameLastName合并为一个字符串strname。 然后,计算每个strname的字母数。

2)此时,我们发现真实姓名,如“MARNIMONTANEZ”,由两个“M”组成;两个'A';一个'R';一个'我';三个'N';一个'O';一个'T'。
我们发现假名,如“GFCTFCGCFGCCGGFCGFCGFCGF”,由六个'G'组成;五'F'; 8'C'。

3)区分真名和假名的模式变得清晰:

  • 真名以更多种类的字母为特征。我们可以通过创建计算为check_real
  • 的变量number of unique letters / total string length来衡量这一点
  • 假名的特点是重复几次重复的字母。我们可以通过创建计算为check_fake
  • 的变量average frequency of each letter来衡量这一点

4)最后,我们只需定义一个阈值来识别两个变量的异常。在触发这些阈值的情况下,会出现flag_realflag_fake

  • 如果flag_real == 1 & flag_fake == 0,名称是真实的
  • 如果flag_real == 0 & flag_fake == 1,名称是假的
  • 在两个标志同意的极少数情况下(即flag_real == 1 & flag_fake == 1),您必须手动调查记录以优化阈值。

答案 2 :(得分:1)

  

您可以通过计算全名中唯一字母的长度除以全名中的字符总数来计算全名(组合FirstName和LastName)的可变性强度。然后,只需删除具有低可变性强度的名称。这意味着您要删除具有相同随机击键频率的名称,从而导致可变性强度较低。​​

我是使用charToRaw函数执行此操作的,因为它更快并使用dplyr库,如下所示:

# Building Test Data
df <- data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7), 
          FirstName = c("MARY", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
          LastName = c("MEYER", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)


#test data: df
#   CustomerId    FirstName         LastName
#1         1           MARY            MEYER
#2         2    FGCYFCGCGFC       FGCGFCHGHG
#3         3    GFCTFCGCFGC      GFCGFCGFCGF
#4         4      ASDASDASD       AASDASDASD
#5         5        GDTDTTD     DTTHDTHTHTHD
#6         6         WALTER             FINN
#7         7    GFCTFCGCFGC   CG GFCGFCGFCGF

library(dplyr)
df %>%
  ## Combining FirstName and LastName
  mutate(FullName = paste(FirstName, gsub(" ", "", LastName, fixed = TRUE))) %>%
  group_by(FullName) %>%
  ## Calculating variability strength for each full name
  mutate(Variability = length(unique(as.integer(charToRaw(FullName))))/nchar(FullName))%>%
  ## Filtering full name, I set above or equal to 0.4 (You can change this)
  ## Meaning we are keeping full name that has variability strength greater than or equal to 0.40
  filter(Variability >= 0.40)


# A tibble: 2 x 5
# Groups:   FullName [2]
# CustomerId FirstName LastName    FullName   Variability
#  <dbl>     <chr>      <chr>        <chr>        <dbl>
#1   1        MARY      MEYER     MARY MEYER    0.6000000
#2   6      WALTER      FINN     WALTER FINN    0.9090909

答案 3 :(得分:0)

我尝试将以下代码中的建议结合起来。谢谢大家的帮助。

# load required libraries 
library(hunspell)
library(dplyr)
# read data in dataframe df
df<-data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7,8), 
               FirstName = c("MARY"," ALBERT SAM", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
               LastName = c("MEYER","TEST", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)
# Keep unique names
df<-distinct(df,FirstName, LastName, .keep_all = TRUE)
# Spell check using hunspel
df$flag <- hunspell_check(df$FirstName) | hunspell_check(as.character(df$LastName))
# remove middle names
df$FirstNameOnly<-gsub(" .*","",df$FirstName)

# SSA name data using https://www.ssa.gov/oact/babynames/names.zip
# unzip files in folder named names
files<-list.files("/names",pattern="*.txt")
ssa_names<- do.call(rbind, lapply(files, function(x) read.csv(x, 
                          col.names = c("Name","Gender","Frequency"),stringsAsFactors = FALSE)))
# Change SSA names to uppercase
ssa_names$Name <- toupper(ssa_names$Name)
# Flad for SSA names
df$flag_SSA<-ifelse(df$FirstNameOnly %in% ssa_names$Name,TRUE,FALSE)
rm(ssa_names)
# remove spaces and concatenate first name and last name
df$strname<-gsub(" ","",paste(df$FirstName,df$LastName, sep = ""))
# Name string length
df$len<-nchar(df$strname)

# Unique string length
for(n in 1:nrow(df))
{
  df$ulen[n]<-length(unique(strsplit(df$strname[n], "")[[1]]))
}

# Ratio variable for unique string length over total string length 
df$ratio<-ifelse(df$len==0,0,df$ulen/df$len)
# Histogram to determine cutoff ratio
hist(df$ratio)
test<-df[df$ratio<.4 & df$flag_SSA==FALSE & df$flag==FALSE,]