R:pmatch:' TERESA DEL CA'与' TERESA DEL#CARMEN'

时间:2016-06-21 17:45:54

标签: r dataframe string-matching bigdata

我有两个数据框几乎相同的数据:

Test.Takers ,包含29260个观察结果和以下列名称:

Paternal.Name,Maternal.Name,First.Name,Application.Number

Every.Student.In.The.Country ,12000000观察以下列名称:

Paternal.Name,Maternal.Name,First.Name,Application.Number

Test.Takers $ Application.Number 填充了NA值,我想用 Every.Student.In.The.Country 。

我尝试通过从 Every.Student.In.The.Country 中对Paternal.Names和Maternal.Names进行子集化来完成此操作。然后我会用以下代码填写 Test.Takers $ Application.Number

Test.Takers$Application.Number[i] <- subset$Application.Number[pmatch(as.character(Test.Taker$First.Name[i]), subset$First.Names)]

这可以填写大约2/3的Test.Takers $ Application.Number。在试图找出为什么这么多Test.Takers $ Application.Number仍然是NA之后我发现 Every.Student.In.The.Country $ First.Name 中的一些名字包含一个&#39; #&#39 ;.我认为&#39;#&#39;抛弃了pmatch函数,以便来自 Test.Takers $ First.Name 的名称,例如&#39; TERESA DEL CA&#39;与 Every.Student.In.The.Country $ First.Name 中的名称不匹配,例如&#39; TERESA DEL#CARMEN&#39;。

关于如何解决这个问题的任何建议都会很棒。我有一种感觉,就像正则表达式的功能可能有所帮助,但我不太确定。

编辑:这是一些复制问题的示例代码。请记住,我正在处理的真实数据非常大 - 每个观察大约30000和12000000。如果您查看此代码并发现任何低效率,请告诉我们。

Test.Takers <- data.frame(
    Paternal.Name = c('Last', 'Last', 'Paternal'),
    Maternal.Name = c('Maternal', 'Last', 'Last'),
    First.Name = c('First', 'Name', 'TERESA DEL CA'),
    Application.Number = NA)

Every.Student.In.The.Country <- data.frame(
    Paternal.Name = c('Last', 'Last', 'Last', 'Paternal', 'Paternal', 'Paternal'),
    Maternal.Name = c('Maternal', 'Last', 'Maternal', 'Last', 'Maternal', 'Last'),
    First.Name = c('First', 'Name', 'Whatever', 'TERESA DEL#CARMEN', 'Another', 'Something Else'),
    Application.Number = c(123, 456, 789, 234, 567, 890)
)

#a place holder that will hold a subset of all a selected paternal last names
indexp <- data.frame(Paternal.Name='name')

for(i in 1:nrow(Test.Takers)) {
    namep <- as.character(Test.Takers$Paternal.Name[i])

    #below if statement prevents us from having to subset the paternal lastname unnecessarily

    if(is.na(indexp$Paternal.Name[1]) == T | as.character(indexp$Paternal.Name[1]) != namep) { 
        indexp <- subset(Every.Student.In.The.Country, Paternal.Name == as.character(Test.Takers$Paternal.Name[i]))
    }

    #below if-statement prevents an error that arrises
    #when a paternal last name does not exist
    #in the Every.Student.In.The.Country file

    if(is.na(indexp$Paternal.Name[1]) == F) {


    #group paternal last names by maternal last names
    indexm <- subset(indexp, Maternal.Name == as.character(Test.Takers$Maternal.Name[i]))    

    #find a partial string match to find an exact or similiar first name within the selected
    #last name subset. Attaches a application.number if a match is found

    Test.Takers$Application.Number[i] <- indexm$Application.Number[pmatch(as.character(Test.Takers$First.Name[i]), indexm$First.Name)]     
}}

1 个答案:

答案 0 :(得分:1)

如果#符号是唯一的问题,您可以在函数中添加duplicates.ok = TRUE     Test.Takers$Application.Number[i] <- subset$Application.Number[pmatch(as.character(Test.Taker$First.Name[i]), subset$First.Names),dup=T]

或者你可以删除#符号     Test.Takers$Application.Number[i] <- subset$Application.Number[pmatch(as.character(Test.Taker$First.Name[i]),gsub("#"," ",subset$First.Names))]