我在R数据框中有以下提到的数据:
DF
structure(list(ID = c("VVC-110", "VVC-111", "VVC-111", "VVC-112",
"VVC-113"), Add = c("255 3RD FLOOR A SQUARE PLOT NO 10 POCKET 4 SECTOR 11 ",
"7045 Liberty Ave. Gastonia, Rose Street ", "22 S. Holly St. \nWinter Garden,.",
"9416 Washington St. \nStafford, Leatherwood Circle", "466 Pawnee Street \nSicklerville,Ridgeview Court \nMundelein,.."
), State = c("Alabama", "Alaska", "Arizona ", "California ",
"Colorado"), City = c("Birmingham", "Anchorage", "Phoenix", "Los Angeles",
"Denver"), Zipcode = c(58765L, 75974L, 98052L, 89406L, 12421L
), Add_1 = c("255, 3rd FLOOR A SQUARE PLOT NO.10 POCKET 4 SECTOR 11, ",
"7045 Liberty Ave. Gastonia, Rose Street View, New", "22 S. Holly St. \nWinter Garden,.",
"9416, Washington St., \nStafford, Leather Wood", "466 Pawnee Street \nSicklerville"
), State_1 = c("Alabama", "Alaskaa", "Arizona", "California",
"Colorado"), City_1 = c("Birmingham", "Anchorage", "Phoenix",
"LosAngeles", "Den ver"), Zipcode_1 = c(58765L, 75974L, 98052L,
89406L, 12421L)), class = "data.frame", row.names = c(NA, -5L
))
通过利用上述数据帧,我想确定特定两个字符串的%
匹配,我可以确保在两个行/列的行数上可能相同。
Add
和Add_1
之间的字符串匹配百分比。State
和State_1
之间的字符串匹配百分比。 免责声明::“必需的输出数据”框中显示的所有%
都是随机的,可以根据逻辑和方法而有所不同。
答案 0 :(得分:3)
我正在使用这种方法来获得左手甜蛋白距离(@ Michael Bird的补充建议):
library(RecordLinkage)
library(dplyr)
df %>%
mutate(levi_add = levenshteinDist(Add, Add_1),
levi_state = levenshteinDist(State, State_1),
n_char_add = nchar(Add),
n_char_State = nchar(State),
levi_add_percent = 100-round(levi_add/n_char_add*100, digits = 1),
levi_state_percent = 100-round(levi_state/n_char_State*100, digits = 1)) %>%
select(ID, levi_add_percent, levi_state_percent)
输出为:
ID levi_add_percent levi_state_percent
1 VVC-110 90.6 100.0
2 VVC-111 77.5 83.3
3 VVC-111 100.0 87.5
4 VVC-112 77.6 90.9
5 VVC-113 50.8 100.0