通过分组变量计算Levenshtein /汉明距离

时间:2019-06-25 09:53:05

标签: r levenshtein-distance hamming-distance stringdist

我正在尝试基于正确的响应(列MEM_Response)来计算参与者的响应(列MEM_Correct)的准确性。分组变量将是参与者的ID(在这种情况下,列SERIAL->每个参与者15个案例)。

dput(example)
structure(list(MEM_Correct = c("ZLHK", "RZKX", "DGWL", "BCJSP", 
"WRKTJ", "CHBXS", "HNDCWX", "SWVNDT", "WLDGPB", "DSHRKBV", "HCXLZWB", 
"HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD", "ZLHK", "RZKX", 
"DGWL", "BCJSP", "WRKTJ", "CHBXS", "HNDCWX", "SWVNDT", "WLDGPB", 
"DSHRKBV", "HCXLZWB", "HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD"
), MEM_Response = c("ZLHK", "RZKX", "DGWL", "BCJSP", "WRKLTJ", 
"CHBXS", "HNDCWX", "SWVDTN", "WLDGPB", "DSHRKBV", "HCXLZWB", 
"HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD", "ZLHK", "RZKX", 
"DGWL", "BCJSB", "WRKTJ", "CHBXA", "HDNDWX", "SWVNDT", "WLGPBD", 
"DSHKRBV", "WLGJHKK", "HDBNVZC", "BCHRKVBM", "RVGBKSNM", "NWHVZWHJ"
), SERIAL = c("4444", "4444", "4444", "4444", "4444", "4444", 
"4444", "4444", "4444", "4444", "4444", "4444", "4444", "4444", 
"4444", "5555", "5555", "5555", "5555", "5555", "5555", "5555", 
"5555", "5555", "5555", "5555", "5555", "5555", "5555", "5555"
)), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 
12L, 13L, 14L, 15L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 
26L, 27L, 28L, 29L, 30L, 31L), class = "data.frame")

我尝试使用多种方法来计算准确性(即正确响应与实际响应之间的距离),但是到目前为止,我没有收到令人满意的输出。

使用stringdist进行汉明和莱文施泰因距离:

Levenshtein:

example$MEM_Lev = stringdist(example$MEM_Correct, example$MEM_Response, method = c("lv"))

击打:

example$MEM_Ham = stringdist(example$MEM_Correct, example$MEM_Response, method = c("hamming"))

问题:我有每种情况的汉明距离,但是我将如何计算每个参与者的准确性,最终以0到1之间的范围(即0到100%的准确性)结束?汉明距离的问题还在于长度不同的情况(请参见第5行: WRKTJ WRKLTJ )会产生inf。所以使用Levenshtein距离可能会更好,对吗?

然后我尝试了with()函数的Levensthein距离:

with(example, levenshteinSim(example$MEM_Correct, example$MEM_Response))

这一次,值位于0到1之间,我认为这是向前的一步。再次进入第5行:WRKTJ(5个字母)与WRKLTJ(6个字母)的不同之处在于,后者在中间有一个额外的“ L”。因此,必须进行1次单一编辑(在这种情况下为删除),才能与正确的响应相匹配。它的Levenshtein值为0.8333对应5/6正确(即使正确值只有5)。 我使用正确的距离功能吗?

最后,我的最后一个问题是:

如何匹配/计算每个参与者的平均准确度? 我还有一个包含所有参与者的df,我想将示例人的输出与每个行与1行= 1个参与者的数据框合并。

我希望这是有道理的-如果没有,我可以尝试提供更多信息。如果您认为我没有使用正确的方法,请随时建议其他方法。

提前谢谢!

1 个答案:

答案 0 :(得分:0)

如何定义“准确性”是一个方法决定,必须由您决定,文献中可能会有一些参考,但这是一个建议。

example$lv.dist <- stringdist(example[,1], example[,2], method="lv")
head(example)
#   MEM_Correct MEM_Response SERIAL lv.dist
# 1        ZLHK         ZLHK   4444       0
# 2        RZKX         RZKX   4444       0
# 3        DGWL         DGWL   4444       0
# 4       BCJSP        BCJSP   4444       0
# 5       WRKTJ       WRKLTJ   4444       1
# 6       CHBXS        CHBXS   4444       0

aggregate(lv.dist ~ SERIAL, example, mean)
#   SERIAL  lv.dist
# 1   4444 0.200000
# 2   5555 1.866667

aggregate(lv.dist ~ SERIAL, example, function(x) round(mean(100/(1+x)), 2))
#   SERIAL lv.dist
# 1   4444   92.22
# 2   5555   54.17

# Using stringsim()
example$lv.sim <- stringsim(example[,1], example[,2], method="lv")

(agg <- aggregate(lv.sim ~ SERIAL, example, function(x) round(mean(x)*100, 2)))
#   SERIAL lv.sim
# 1   4444  96.67
# 2   5555  73.25

# Merging two data.frames is easy as long as they have a have a 
# column in common (SERIAL in this case)    
participants <- data.frame(age=7:9, SERIAL=c(5555, 4444, 1234))

merge(participants, agg)
#   SERIAL age lv.sim
# 1   4444   9  96.67
# 2   5555   8  73.25

merge(participants, agg, all=TRUE)
#   SERIAL age lv.sim
# 1   1234   9     NA
# 2   4444   8  96.67
# 3   5555   7  73.25