我有两个数据集
datf1 <- data.frame (name = c("regular", "kklmin", "notSo", "Jijoh",
"Kish", "Lissp", "Kcn", "CCCa"),
number1 = c(1, 8, 9, 2, 18, 25, 33, 8))
#-----------
name number1
1 regular 1
2 kklmin 8
3 notSo 9
4 Jijoh 2
5 Kish 18
6 Lissp 25
7 Kcn 33
8 CCCa 8
datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean", "LiSsp",
"KcN", "CaPN"),
number2 = c(2, 8, 12, 13, 20, 18, 13))
#-------------
name number2
1 reGulr 2
2 ntSo 8
3 Jijoh 12
4 sean 13
5 LiSsp 20
6 KcN 18
7 CaPN 13
我想通过名称列合并它们,但允许部分匹配(以避免在大型数据集中合并拼写错误,甚至检测此类拼写错误),例如
(1)如果在任何位置连续四个字母(全部,如果字母数小于4) - 匹配就好了
ABBCD = BBCDK = aBBCD = ramABBBCD = ABB
(2)比赛中的区分大小写已关闭,例如ABBCD = aBbCd
(3)新数据集将保留两个名称(来自datf1和datf2的名称)。所以这封信我们可以检测到匹配是否完美(可能是一个单独的列,其中有多少个字母匹配)
这种合并是否可能?
编辑:
datf1 <- data.frame (name = c("xxregular", "kklmin", "notSo", "Jijoh",
"Kish", "Lissp", "Kcn", "CCCa"),
number1 = c(1, 8, 9, 2, 18, 25, 33, 8))
datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean",
"LiSsp", "KcN", "CaPN"),
number2 = c(2, 8, 12, 13, 20, 18, 13))
uglyMerge(datf1, datf2)
name1 name2 number1 number2 matches
1 xxregular <NA> 1 NA 0
2 kklmin <NA> 8 NA 0
3 notSo <NA> 9 NA 0
4 Jijoh Jijoh 2 12 5
5 Kish <NA> 18 NA 0
6 Lissp LiSsp 25 20 5
7 Kcn KcN 33 18 3
8 CCCa <NA> 8 NA 0
9 <NA> reGulr NA 2 0
10 <NA> ntSo NA 8 0
11 <NA> sean NA 13 0
12 <NA> CaPN NA 13 0
答案 0 :(得分:7)
也许有一个简单的解决方案,但我找不到任何
恕我直言,你必须为自己实现这种合并。
请在下面找到一个丑陋的例子(有很多改进空间):
uglyMerge <- function(df1, df2) {
## lower all strings to allow case-insensitive comparison
lowerNames1 <- tolower(df1[, 1]);
lowerNames2 <- tolower(df2[, 1]);
## split strings into single characters
names1 <- strsplit(lowerNames1, "");
names2 <- strsplit(lowerNames2, "");
## create the final dataframe
mergedDf <- data.frame(name1=as.character(df1[,1]), name2=NA,
number1=df1[,2], number2=NA, matches=0,
stringsAsFactors=FALSE);
## store names of dataframe2 (to remember which strings have no match)
toMerge <- df2[, 1];
for (i in seq(along=names1)) {
for (j in seq(along=names2)) {
## set minimal match to 4 or to string length
minMatch <- min(4, length(names2[[j]]));
## find single matches
matches <- names1[[i]] %in% names2[[j]];
## look for consecutive matches
r <- rle(matches);
## any matches found?
if (any(r$values)) {
## find max consecutive match
possibleMatch <- r$value == TRUE;
maxPos <- which(which.max(r$length[possibleMatch]) & possibleMatch)[1];
## store max conscutive match length
maxMatch <- r$length[maxPos];
## to remove FALSE-POSITIVES (e.g. CCC and kcn) find
## largest substring
start <- sum(r$length[0:(maxPos-1)]) + 1;
stop <- start + r$length[maxPos] - 1;
maxSubStr <- substr(lowerNames1[i], start, stop);
## all matching criteria fulfilled
isConsecutiveMatch <- maxMatch >= minMatch &&
grepl(pattern=maxSubStr, x=lowerNames2[j], fixed=TRUE) &&
nchar(maxSubStr) > 0;
if (isConsecutiveMatch) {
## merging
mergedDf[i, "matches"] <- maxMatch
mergedDf[i, "name2"] <- as.character(df2[j, 1]);
mergedDf[i, "number2"] <- df2[j, 2];
## don't append this row to mergedDf because already merged
toMerge[j] <- NA;
## stop inner for loop here to avoid possible second match
break;
}
}
}
}
## append not matched rows to mergedDf
toMerge <- which(df2[, 1] == toMerge);
df2 <- data.frame(name1=NA, name2=as.character(df2[toMerge, 1]),
number1=NA, number2=df2[toMerge, 2], matches=0,
stringsAsFactors=FALSE);
mergedDf <- rbind(mergedDf, df2);
return (mergedDf);
}
输出:
> uglyMerge(datf1, datf2)
name1 name2 number1 number2 matches
1 xxregular reGulr 1 2 5
2 kklmin <NA> 8 NA 0
3 notSo <NA> 9 NA 0
4 Jijoh Jijoh 2 12 5
5 Kish <NA> 18 NA 0
6 Lissp LiSsp 25 20 5
7 Kcn KcN 33 18 3
8 CCCa <NA> 8 NA 0
9 <NA> ntSo NA 8 0
10 <NA> sean NA 13 0
11 <NA> CaPN NA 13 0
答案 1 :(得分:5)
agrep
会让你开始。
类似的东西:
lapply(tolower(datf1$name), function(x) agrep(x, tolower(datf2$name)))
然后您可以调整max.distance
参数,直到获得适当的匹配量。然后你喜欢合并。