Question

我有一个data.table，其中一列列出了正在运送的货物的统一关税代码。有一些输入问题，因为有时一行可能有重复的数字“7601.00; 7601.00”，有时它可能有不同的数字，“7601.00; 8800.00”。当我有不同的条目时，我还没有决定该怎么做，但我想要做的第一件事是摆脱重复。所以我写了一个自定义用户定义函数：

unique_hscodes <- function(hs_input){


  new <- strsplit(hs_input, split = ";")                   # Delimiter ;
  new <- lapply(new, str_replace_all, " ", "")

  if (length(unique(unlist(new))) == 1) {                  # Unique HS code
    return(unique(unlist(new)))
  }  
  else {

  new <- names(sort(table(unlist(new)),decreasing=TRUE)[1]) # Most frequent

  return(new) 

  } 

}

当我这样做时，DT[, hs_code := unique_hscodes(hscode)]它会返回一个数据表，其中包含具有相同数字的列hs_code。但是当我做DT[, hs_code := unique_hscodes(hscode), by =1:nrow(DT)]时，它就完成了。

有人可以解释一下这里发生了什么吗？

Answer 1

您的代码在字符串拆分后从单个项目输入中返回多个项目。使用by = 1：nrow（DT）运行它时，一次只检查一行。当仅呈现单行时不会出现该问题。

 DT <- data.table(hscode=c("7601.00; 7601.00" , "7601.00; 8800.00"))
 DT
#-----
             hscode
1: 7601.00; 7601.00
2: 7601.00; 8800.00
#--
 DT[ ,  table( unlist( strsplit(hscode, split="; "))) ]

#7601.00 8800.00 
#      3       1 
 DT[ ,  table( unlist( strsplit(hscode, split="; "))) , by=1:nrow(DT)]
#---------
  nrow V1
1:    1  2
2:    2  1
3:    2  1

我用简单的例子尝试了@ Jaap的代码，但它只将列分成两部分：

> DT[, hs_code := sapply(hscode, unique_hscodes)]
> DT
             hscode hs_code
1: 7601.00; 7601.00 7601.00
2: 7601.00; 8800.00 7601.00

数据表用户定义的函数

1 个答案: