Question

我试图对包含作者姓名的data.table执行近似字符串匹配，该字典基于“名字”字典。我还设置了高于0.9的较高阈值，以提高匹配质量。

但是，我收到以下错误消息：

Warning message:
In [`<-.data.table`(x, j = name, value = value) :
Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).

即使我使用signif（similarity_score，4）将相似度匹配向下舍入到4位数字，也会发生此错误。

有关输入数据和方法的更多信息：

author_corrected_df是一个数据表，其中包含“ Author”和“ Author_Corrected”列。 Author_Corrected是相应作者的字母表示形式（例如：如果Author = Jack123，则Author_Corrected = Jack）。
Author_Corrected列可以具有适当的名字的变体，例如：Jackk代替Jack，我想在此author_corrected_df中填充相应的性别，称为Gender_Dict。
另一个名为first_names_dict的数据表包含“名称”（即名字）和性别（女性为0，男性为1，关系为2）。
我想根据first_names_dict中的“名称”从每行的“ Author_Corrected”中找到最相关的匹配项，并填充相应的性别（0,1,2之一）。
为了使字符串匹配更加严格，我使用0.9720的阈值，否则在代码的后面（下面未显示），然后将不匹配的值表示为NA。
可以从下面的链接访问first_names_dict和author_corrected_df： https://wetransfer.com/downloads/6efe42597519495fcd2c52264c40940a20190612130618/0cc87541a9605df0fcc15297c4b18b7d20190612130619/6498a7

for (ijk in 1:nrow(author_corrected_df)){
  max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
  if (signif(max_sim1,4) >= 0.9720){
    row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))
    author_corrected_df$Gender_Dict[ijk] <- first_names_dict$gender[row_idx1]
  } else {
    next
  }
}

执行时，我收到以下错误消息：

Warning message:
In `[<-.data.table`(x, j = name, value = value) :
  Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).

在了解错误的出处以及是否有更快的方法执行这种匹配（尽管第二种是第二优先级）方面，将非常感谢您的帮助。

谢谢。

Answer 1

在前面的评论之后，我在这里选择您所选择的性别最多的性别：

for (ijk in 1:nrow(author_corrected_df)){
        max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
        if (signif(max_sim1,4) >= 0.9720){
                row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))

                # Analysis of factor gender
                gender <- as.character( first_names_dict$gender[row_idx1] )

                # I take the (first) gender most present in selection 
                df_count <- as.data.frame( table(gender) )
                ref <- as.character ( df_count$test[which.max(df_count$Freq)] )
                value <- unique ( test[which(test == ref)] )

                # Affecting single character value to data frame
                author_corrected_df$Gender_Dict[ijk] <- value
        }
}

希望这会有所帮助：）

在R

1 个答案: