我试图对包含作者姓名的data.table执行近似字符串匹配,该字典基于“名字”字典。我还设置了高于0.9的较高阈值,以提高匹配质量。
但是,我收到以下错误消息:
Warning message:
In [`<-.data.table`(x, j = name, value = value) :
Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).
即使我使用signif(similarity_score,4)将相似度匹配向下舍入到4位数字,也会发生此错误。
有关输入数据和方法的更多信息:
for (ijk in 1:nrow(author_corrected_df)){
max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
if (signif(max_sim1,4) >= 0.9720){
row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))
author_corrected_df$Gender_Dict[ijk] <- first_names_dict$gender[row_idx1]
} else {
next
}
}
执行时,我收到以下错误消息:
Warning message:
In `[<-.data.table`(x, j = name, value = value) :
Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).
在了解错误的出处以及是否有更快的方法执行这种匹配(尽管第二种是第二优先级)方面,将非常感谢您的帮助。
谢谢。
答案 0 :(得分:1)
在前面的评论之后,我在这里选择您所选择的性别最多的性别:
for (ijk in 1:nrow(author_corrected_df)){
max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
if (signif(max_sim1,4) >= 0.9720){
row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))
# Analysis of factor gender
gender <- as.character( first_names_dict$gender[row_idx1] )
# I take the (first) gender most present in selection
df_count <- as.data.frame( table(gender) )
ref <- as.character ( df_count$test[which.max(df_count$Freq)] )
value <- unique ( test[which(test == ref)] )
# Affecting single character value to data frame
author_corrected_df$Gender_Dict[ijk] <- value
}
}
希望这会有所帮助:)