我正在尝试基于正确的响应(列MEM_Response
)来计算参与者的响应(列MEM_Correct
)的准确性。分组变量将是参与者的ID(在这种情况下,列SERIAL
->每个参与者15个案例)。
dput(example)
structure(list(MEM_Correct = c("ZLHK", "RZKX", "DGWL", "BCJSP",
"WRKTJ", "CHBXS", "HNDCWX", "SWVNDT", "WLDGPB", "DSHRKBV", "HCXLZWB",
"HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD", "ZLHK", "RZKX",
"DGWL", "BCJSP", "WRKTJ", "CHBXS", "HNDCWX", "SWVNDT", "WLDGPB",
"DSHRKBV", "HCXLZWB", "HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD"
), MEM_Response = c("ZLHK", "RZKX", "DGWL", "BCJSP", "WRKLTJ",
"CHBXS", "HNDCWX", "SWVDTN", "WLDGPB", "DSHRKBV", "HCXLZWB",
"HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD", "ZLHK", "RZKX",
"DGWL", "BCJSB", "WRKTJ", "CHBXA", "HDNDWX", "SWVNDT", "WLGPBD",
"DSHKRBV", "WLGJHKK", "HDBNVZC", "BCHRKVBM", "RVGBKSNM", "NWHVZWHJ"
), SERIAL = c("4444", "4444", "4444", "4444", "4444", "4444",
"4444", "4444", "4444", "4444", "4444", "4444", "4444", "4444",
"4444", "5555", "5555", "5555", "5555", "5555", "5555", "5555",
"5555", "5555", "5555", "5555", "5555", "5555", "5555", "5555"
)), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
12L, 13L, 14L, 15L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L,
26L, 27L, 28L, 29L, 30L, 31L), class = "data.frame")
我尝试使用多种方法来计算准确性(即正确响应与实际响应之间的距离),但是到目前为止,我没有收到令人满意的输出。
使用stringdist
进行汉明和莱文施泰因距离:
Levenshtein:
example$MEM_Lev = stringdist(example$MEM_Correct, example$MEM_Response, method = c("lv"))
击打:
example$MEM_Ham = stringdist(example$MEM_Correct, example$MEM_Response, method = c("hamming"))
问题:我有每种情况的汉明距离,但是我将如何计算每个参与者的准确性,最终以0到1之间的范围(即0到100%的准确性)结束?汉明距离的问题还在于长度不同的情况(请参见第5行: WRKTJ 与 WRKLTJ )会产生inf
。所以使用Levenshtein距离可能会更好,对吗?
然后我尝试了with()
函数的Levensthein距离:
with(example, levenshteinSim(example$MEM_Correct, example$MEM_Response))
这一次,值位于0到1之间,我认为这是向前的一步。再次进入第5行:WRKTJ(5个字母)与WRKLTJ(6个字母)的不同之处在于,后者在中间有一个额外的“ L”。因此,必须进行1次单一编辑(在这种情况下为删除),才能与正确的响应相匹配。它的Levenshtein值为0.8333对应5/6正确(即使正确值只有5)。 我使用正确的距离功能吗?
最后,我的最后一个问题是:
如何匹配/计算每个参与者的平均准确度? 我还有一个包含所有参与者的df,我想将示例人的输出与每个行与1行= 1个参与者的数据框合并。
我希望这是有道理的-如果没有,我可以尝试提供更多信息。如果您认为我没有使用正确的方法,请随时建议其他方法。
提前谢谢!
答案 0 :(得分:0)
如何定义“准确性”是一个方法决定,必须由您决定,文献中可能会有一些参考,但这是一个建议。
example$lv.dist <- stringdist(example[,1], example[,2], method="lv")
head(example)
# MEM_Correct MEM_Response SERIAL lv.dist
# 1 ZLHK ZLHK 4444 0
# 2 RZKX RZKX 4444 0
# 3 DGWL DGWL 4444 0
# 4 BCJSP BCJSP 4444 0
# 5 WRKTJ WRKLTJ 4444 1
# 6 CHBXS CHBXS 4444 0
aggregate(lv.dist ~ SERIAL, example, mean)
# SERIAL lv.dist
# 1 4444 0.200000
# 2 5555 1.866667
aggregate(lv.dist ~ SERIAL, example, function(x) round(mean(100/(1+x)), 2))
# SERIAL lv.dist
# 1 4444 92.22
# 2 5555 54.17
# Using stringsim()
example$lv.sim <- stringsim(example[,1], example[,2], method="lv")
(agg <- aggregate(lv.sim ~ SERIAL, example, function(x) round(mean(x)*100, 2)))
# SERIAL lv.sim
# 1 4444 96.67
# 2 5555 73.25
# Merging two data.frames is easy as long as they have a have a
# column in common (SERIAL in this case)
participants <- data.frame(age=7:9, SERIAL=c(5555, 4444, 1234))
merge(participants, agg)
# SERIAL age lv.sim
# 1 4444 9 96.67
# 2 5555 8 73.25
merge(participants, agg, all=TRUE)
# SERIAL age lv.sim
# 1 1234 9 NA
# 2 4444 8 96.67
# 3 5555 7 73.25