Question

我遇到了脏数据的问题。

我有一个约60万名的名单。

一个例子是：

John Doe, 
JohnDoe, 
JohnDoe2, 
JohnDoe 84302,

我想使用Python或R来迭代这个列表并获取上面最接近的匹配记录（使用最接近匹配的概率）并替换当前记录，所以上面列表看起来像：

Iteration 1:
John Doe,
John Doe,
JohnDoe2,
JohnDoe 84302,

Iteration 2:
John Doe,
John Doe,
John Doe,
JohnDoe 84302,

Iteration 3:
John Doe,
John Doe,
John Doe,
John Doe,

非常感谢任何帮助。

由于

Answer 1

我同意@Martin Sand Christensen告诉你的内容，但如果你不知道怎么做，那么你可以自己开发。

考虑到您需要在每次迭代时找到与第一个最接近的名称：

<%@page contentType="text/html" pageEncoding="UTF-8"%>
<!DOCTYPE html>
<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>JSP Page</title>
</head>
<body>
    <jsp:useBean id="mybean" scope="session"     class="org.mypackage.hello.NameHandler" />
    <jsp:setProperty name="mybean" property="name" />
    <jsp:getProperty name="mybean" property="name" />


    <jsp:useBean id="stringBean" class="org.mypackage.hello.NameHandler" />


    <jsp:setProperty name="stringBean" property="assignmentName"  value="propertyValue" />

    <ol>
        <li><jsp:getProperty name="stringBean"   property="assignmentName" /></li>
        <li><jsp:getProperty name="stringBean" property="type" /></li>
        <li><jsp:getProperty name="stringBean" property="moduleCode" /></li>
        <li><jsp:getProperty name="stringBean" property="moduleName" /></li>
        <li><jsp:getProperty name="stringBean" property="weight" /></li>
        <li><jsp:getProperty name="stringBean" property="date" /></li>
    </ol>
 </body>

 </html>

我们定义char矢量。然后我们迭代所有其他名称。如果我们检测到分别包含“John”和“Doe”的名称，我们将其设置为“John Doe”。如果名称中只有“John”或“Doe”，我们不会触摸它。

Answer 2

如果您想要近距离匹配，可能需要soundex algorithm才能解决问题。

首先，创建一个字符串向量。我在你的例子中加了一些。

x <- scan(what = character(), text = "
John Doe, 
JohnDoe, 
JohnDoe2, 
JohnDoe 84302,
Punit Patel,
PunitPatel,
Punit_Patel,
PunitPatel2
", sep = ",")

x <- x[trimws(x) != ""]

现在，有几个包实现了soundex，stringdist就是其中之一。您需要先install.packages("stringdist")。

library(stringdist)

phonetic(x)    # Run this if you want to see the soundex codes

ave(x, phonetic(x), FUN = function(.x) .x[1])
#[1] "John Doe"    "John Doe"    "John Doe"    "John Doe"    "Punit Patel"
#[6] "Punit Patel" "Punit Patel" "Punit Patel"

数据清理，列表迭代和最接近的匹配，Python或R.

2 个答案: