我有两个data.frames(df1
和df2
)。每个df包含三列。我将使用df1$This
在df2$That
中找到完美匹配和部分匹配。问题是两列都包含重复项。因此,我创建了df1$id
和df2$ids
,为每一行提供唯一标识符。
为了解决我的第一个问题,我首先获取完美匹配的字符串,然后通过计算Levenshtein距离来获得部分匹配。我找到了一个函数here并将其调整为我的数据和所需的输出。
signature = function(x){
sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
return(sig)
}
partialMatch = function(x,y,levDist=0.05){
xx=data.frame(sig=sapply(x, signature),row.names=NULL)
yy=data.frame(sig=sapply(y, signature),row.names=NULL)
xx$raw=x
yy$raw=y
xx=subset(xx,subset=(sig!=''))
xy=merge(xx,yy,by='sig',all=T)
matched=subset(xy,subset=(!(is.na(raw.x)) & !(is.na(raw.y))))
matched$type="Perfect"
todo=subset(xy,subset=(is.na(raw.y)),select=c(sig,raw.x))
colnames(todo)=c('sig','raw')
todo$partials= as.character(sapply(todo$sig, agrep, yy$sig,max.distance = levDist,value=T))
todo=merge(todo,yy,by.x='partials',by.y='sig',all=T)
partial.matched=subset(todo,subset=(!(is.na(raw.x)) & !(is.na(raw.y))),select=c("sig","raw.x", "raw.y"))
partial.matched$type="Partial"
matched=rbind(matched,partial.matched)
matched=subset(matched,select=c("raw.x", "raw.y","type"))
return(matched)
}
我使用df3 <- partialMatch(df1$This, df2$That)
调用此函数,这会产生包含raw.x (df1$This)
,raw.y (df2$That)
和类型(Perfect/Partial)
的df。我的问题是,如果我可以在这两列上调用该函数,但保留其他列中的数据以获得最终结果。与df3 <- partialMatch(df$This, df$That)
一样,并为每列提供每行。所以我的数据来自
df1
This id detail
ab c 1 male
a c d 2 male
a d d 3 female
ab c 4 female
df2
That ids details
abc x strong
a c d y weak
aff z maniac
ab c k lord
要
df3
This That type id detail ids details
ab c abc perfect 1 male x strong
ab c ab c perfect 1 male k lord
ab c abc perfect 4 female x strong
ab c ab c perfect 4 female k lord
a c d a c d perfect 2 male y weak
或者我必须在功能中编辑吗?由于重复问题,我在运行该功能后无法合并。
也许apply
是正确的方法?我试过以下,显然没有用。发贴灵感。
apply(df1[,c('id','detail')], 1, partialMatch(df1$This, df2$That))
答案 0 :(得分:1)
注意:我不确定这是否100%有效,但它应该让您了解需要更改的内容,以便集成您想要的任何功能。
这只是重写函数的问题:
partialMatch = function(df1, col1, df2, col2, levDist=0.05){
x <- df1[,col1]
y <- df2[,col2]
xx <- data.frame(df1,
sig = gsub("[[:space:]]", "", x),
raw = x,
row.names=NULL)
yy <- data.frame(df2,
sig = gsub("[[:space:]]", "", y),
raw = y,
row.names=NULL)
xx <- subset(xx,subset=(sig!=''))
xy <- merge(xx,yy,by='sig',all=T)
xy$type <- NA
xy$type[!(is.na(xy$raw.x) | is.na(xy$raw.y))] <- "Perfect"
todo <- subset(xy,subset=(is.na(raw.y)),select=c(sig,raw.x))
colnames(todo) <- c('sig','raw')
todo$partials <- as.character(sapply(todo$sig, agrep, yy$sig,max.distance = levDist,value=T))
if (nrow(todo) > 0) {
todo <- merge(todo,xy,by.x='partials',by.y='sig',all=T)
partial.matched <- subset(todo,subset=(!(is.na(raw.x)) & !(is.na(raw.y))))
partial.matched$type <- "Partial"
partial.matched[,col2] <- partial.matched$raw
partial.matched <- partial.matched[,!names(partial.matched) %in% c("raw","partials")]
xy <- rbind(xy,partial.matched)
}
xy <- subset(xy,select=c(col1,col2,"type","id","detail","ids","details"))
xy <- xy[!(is.na(xy[,col1]) | is.na(xy[,col2])),]
rownames(xy) <- NULL
return(xy)
}
partialMatch(df1, "This", df2, "That")
# This That type id detail ids details
# 1 ab c abc Perfect 1 male x strong
# 2 ab c ab c Perfect 1 male k lord
# 3 ab c abc Perfect 4 female x strong
# 4 ab c ab c Perfect 4 female k lord
# 5 a c d a c d Perfect 2 male y weak
# 6 a c d a d d Partial 2 male y weak
编辑:修正了没有完美匹配或部分匹配的情况。