我有大约800万行数据框,如下所示:
Trevor Brown Chris Coghlan Starlin Castro Kelby Tomlinson Brandon Crawford Brandon Crawford Kelby Tomlinson Brandon Crawford
Buster Posey Chris Coghlan Starlin Castro Kelby Tomlinson Brandon Crawford Brandon Crawford Kelby Tomlinson Brandon Crawford
.
.
.
.
Trevor Brown Brandon Crawford Starlin Castro Kelby Tomlinson Brandon Crawford Brandon Crawford Kelby Tomlinson Brandon Crawford
许多行都有重复的名称,我希望将其删除。我发现很难对每行进行矢量化,然后检查重复,因为它需要永远,因为数据帧有800万行。有没有更快的方法来完成这项任务?
答案 0 :(得分:0)
从我可以从问题和评论中收集到的内容,我提出了这个解决方案。
require(gtools)
a <- LETTERS[1:8]
data <- permutations(n = 8, r = 8, v = a)
tail(data)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [40315,] "H" "G" "F" "E" "D" "A" "B" "C"
# [40316,] "H" "G" "F" "E" "D" "A" "C" "B"
# [40317,] "H" "G" "F" "E" "D" "B" "A" "C"
# [40318,] "H" "G" "F" "E" "D" "B" "C" "A"
# [40319,] "H" "G" "F" "E" "D" "C" "A" "B"
# [40320,] "H" "G" "F" "E" "D" "C" "B" "A"
这可以解决问题吗? (它创建8!
组合,任何行中都没有重复两次字母)
答案 1 :(得分:0)
df$unique_names <- " "
for(i in 1:nrow(df)){
df$unique_names[i]<- paste0(unique(unlist(strsplit(df$names[i]," "))),collapse=" ")
}
df$unique_names
[1] "Trevor Brown Chris Coghlan Starlin Castro Kelby Tomlinson Brandon Crawford"
[2] "Buster Posey Chris Coghlan Starlin Castro Kelby Tomlinson Brandon Crawford"
df <- data.frame(names=c("Trevor Brown Chris Coghlan Starlin Castro Kelby Tomlinson Brandon Crawford Brandon Crawford Kelby Tomlinson Brandon Crawford"
,"Buster Posey Chris Coghlan Starlin Castro Kelby Tomlinson Brandon Crawford Brandon Crawford Kelby Tomlinson Brandon Crawford"
),stringsAsFactors = F)