Question

我遇到了在数据集中找到接近重复行的问题。对于我的数据，我必须添加“POSSIBLE_DUPLICATES”列，其中应包含可能重复的索引。数据不仅包含字段FNAME和LNAME，还包含其他一些，也可用于查找重复项。

| id | FNAME  | LNAME   | POSSIBLE_DUPLICATES |
|----|--------|---------|---------------------|
| 1  | Aaron  | Golding | 2,3                 |
| 2  | Aroon  | Golding | 1,3                 |
| 3  | Aaron  | Golding | 2,1                 |
| 4  | John   | Bold    | 6                   |
| 5  | Markus | M.      |                     |
| 6  | John   | Bald    | 4                   |

我试图找到agrep（）函数的标记，但我真的不明白，如何为多列调用它以及如何连接所有行的标记。任何帮助将不胜感激。

Answer 1

以下是在添加的字段（“匹配”）上使用agrep的示例，该字段是您要用于标识重复项的所选字段的串联（根据需要添加其他字段）。在此示例中，列表索引对应于data.frame的行。

# make a mock data.frame
df <- read.csv(textConnection("
id,FNAME,LNAME
1,Aaron,Golding
2,Aroon,Golding
3,Aaron,Golding
4,John,Bold
5,Markus,M.
6,John,Bald
"))

# string together the fields that might be matching and add to data.frame
df$match <- paste0(trimws(as.character(df$FNAME)), 
  trimws(as.character(df$LNAME)))

# make an empty list to fill in
possibleDups <- list()

# loop through each row and find matching strings
for(i in 1:nrow(df)){
  dups <- agrep(df$match[i], df$match)
  if(length(dups) != 1){possibleDups[[i]] <- dups[dups != i]} else {
    possibleDups[[i]] <- NA
  }
}

# proof - print the list of possible duplicates
print(possibleDups) 

> [[1]]
> [1] 2 3

> [[2]]
> [1] 1 3

> [[3]]
> [1] 1 2

> [[4]]
> [1] 6

> [[5]]
> [1] NA

> [[6]]
> [1] 4

如果你只想要一个重复的字符串列表，你可以使用这个循环而不是前一个循环，并删除创建空列表的行。

for(i in 1:nrow(df)){
  dups = agrep(df$match[i], df$match)
  if(length(dups) != 1){df$possibleDups[i] <- paste(dups[dups != i], 
    collapse = ',')} else {
    df$possibleDups[i] <- NA
  }
}

print(df)

>   id  FNAME   LNAME        match possibleDups
> 1  1  Aaron Golding AaronGolding          2,3
> 2  2  Aaron Golding AaronGolding          1,3
> 3  3  Aaron Golding AaronGolding          1,2
> 4  4   John    Bold     JohnBold            6
> 5  5 Markus      M.     MarkusM.         <NA>
> 6  6   John    Bald     JohnBald            4

查找具有接近重复值的行的索引

1 个答案: