在列上运行函数并保留行

时间:2016-04-20 10:55:58

标签: r function

我有两个data.frames(df1df2)。每个df包含三列。我将使用df1$Thisdf2$That中找到完美匹配和部分匹配。问题是两列都包含重复项。因此,我创建了df1$iddf2$ids,为每一行提供唯一标识符。

为了解决我的第一个问题,我首先获取完美匹配的字符串,然后通过计算Levenshtein距离来获得部分匹配。我找到了一个函数here并将其调整为我的数据和所需的输出。

signature = function(x){
  sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
  return(sig)
}


partialMatch = function(x,y,levDist=0.05){
  xx=data.frame(sig=sapply(x, signature),row.names=NULL)
  yy=data.frame(sig=sapply(y, signature),row.names=NULL)
  xx$raw=x
  yy$raw=y
  xx=subset(xx,subset=(sig!=''))
  xy=merge(xx,yy,by='sig',all=T)
  matched=subset(xy,subset=(!(is.na(raw.x)) & !(is.na(raw.y))))
  matched$type="Perfect"
  todo=subset(xy,subset=(is.na(raw.y)),select=c(sig,raw.x))
  colnames(todo)=c('sig','raw')
  todo$partials= as.character(sapply(todo$sig, agrep, yy$sig,max.distance = levDist,value=T))
  todo=merge(todo,yy,by.x='partials',by.y='sig',all=T)
  partial.matched=subset(todo,subset=(!(is.na(raw.x)) & !(is.na(raw.y))),select=c("sig","raw.x", "raw.y"))
  partial.matched$type="Partial"
  matched=rbind(matched,partial.matched)
  matched=subset(matched,select=c("raw.x", "raw.y","type"))

  return(matched)
}

我使用df3 <- partialMatch(df1$This, df2$That)调用此函数,这会产生包含raw.x (df1$This)raw.y (df2$That)和类型(Perfect/Partial)的df。我的问题是,如果我可以在这两列上调用该函数,但保留其他列中的数据以获得最终结果。与df3 <- partialMatch(df$This, df$That)一样,并为每列提供每行。所以我的数据来自

df1
This    id    detail
ab c    1     male
a c d   2     male
a d d   3     female
ab c    4     female

df2
That   ids   details
abc    x     strong
a c d  y     weak
aff    z     maniac
ab c   k     lord

df3
This  That    type    id  detail ids  details
ab c   abc    perfect 1   male   x    strong
ab c   ab c   perfect 1   male   k    lord
ab c   abc    perfect 4   female x    strong
ab c   ab c   perfect 4   female k    lord
a c d  a c d  perfect  2  male   y    weak

或者我必须在功能中编辑吗?由于重复问题,我在运行该功能后无法合并。

也许apply是正确的方法?我试过以下,显然没有用。发贴灵感。

apply(df1[,c('id','detail')], 1, partialMatch(df1$This, df2$That))

1 个答案:

答案 0 :(得分:1)

注意:我不确定这是否100%有效,但它应该让您了解需要更改的内容,以便集成您想要的任何功能。

这只是重写函数的问题:

partialMatch = function(df1, col1, df2, col2, levDist=0.05){
  x <- df1[,col1]  
  y <- df2[,col2]
  xx <- data.frame(df1, 
                   sig = gsub("[[:space:]]", "", x),
                   raw = x,
                   row.names=NULL)
  yy <- data.frame(df2, 
                   sig = gsub("[[:space:]]", "", y),
                   raw = y,
                   row.names=NULL)
  xx <- subset(xx,subset=(sig!=''))
  xy <- merge(xx,yy,by='sig',all=T)
  xy$type <- NA
  xy$type[!(is.na(xy$raw.x) | is.na(xy$raw.y))] <- "Perfect"
  todo <- subset(xy,subset=(is.na(raw.y)),select=c(sig,raw.x))
  colnames(todo) <- c('sig','raw')
  todo$partials <- as.character(sapply(todo$sig, agrep, yy$sig,max.distance = levDist,value=T))
  if (nrow(todo) > 0) {
    todo <- merge(todo,xy,by.x='partials',by.y='sig',all=T)
    partial.matched <- subset(todo,subset=(!(is.na(raw.x)) & !(is.na(raw.y))))
    partial.matched$type <- "Partial"
    partial.matched[,col2] <- partial.matched$raw
    partial.matched <- partial.matched[,!names(partial.matched) %in% c("raw","partials")]
    xy <- rbind(xy,partial.matched)
  }
  xy <- subset(xy,select=c(col1,col2,"type","id","detail","ids","details"))
  xy <- xy[!(is.na(xy[,col1]) | is.na(xy[,col2])),]
  rownames(xy) <- NULL
  return(xy)

}

partialMatch(df1, "This", df2, "That")
#    This  That    type id detail ids details
# 1  ab c   abc Perfect  1   male   x  strong
# 2  ab c  ab c Perfect  1   male   k    lord
# 3  ab c   abc Perfect  4 female   x  strong
# 4  ab c  ab c Perfect  4 female   k    lord
# 5 a c d a c d Perfect  2   male   y    weak
# 6 a c d a d d Partial  2   male   y    weak

编辑:修正了没有完美匹配或部分匹配的情况。