我有这个数据框df
df <- data.frame(stringsAsFactors=FALSE,
id = c(1L, 2L, 3L, 4L, 5L, 6L),
Country = c("ESP", "ESP", "ESP", "ITA", "ITA", "ITA"),
Year = c(1965L, 1965L, 1965L, 1965L, 1965L, 1965L),
Time.step = c("Month", "Month", "Month", "Month", "Month", "Month"),
GSA.numb = c("GSA 5", "GSA 5", "GSA 5", "GSA 17", "GSA 17", "GSA 17"),
Species = c("Mullus", "Mullus", "Mullus", "Eledone", "Eledone", "Eledone"),
Quantity = c(500L, 200L, 200L, 350L, 350L, 125L)
)
df
id Country Year Time.step GSA.numb Species Quantity
1 ESP 1965 Month GSA 5 Mullus 500
2 ESP 1965 Month GSA 5 Mullus 200
3 ESP 1965 Month GSA 5 Mullus 200
4 ITA 1965 Month GSA 17 Eledone 350
5 ITA 1965 Month GSA 17 Eledone 350
6 ITA 1965 Month GSA 17 Eledone 125
我有一些重复的行,例如:3和5。 当行重复时,我可以为F或T逻辑值创建一列:
df$dup <- duplicated(df[,2:7]) #No id!
结果:
id Country Year Time.step GSA.numb Species Quantity dup
1 ESP 1965 Month GSA 5 Mullus 500 FALSE
2 ESP 1965 Month GSA 5 Mullus 200 FALSE
3 ESP 1965 Month GSA 5 Mullus 200 TRUE
4 ITA 1965 Month GSA 17 Eledone 350 FALSE
5 ITA 1965 Month GSA 17 Eledone 350 TRUE
6 ITA 1965 Month GSA 17 Eledone 125 FALSE
现在,我想要一个新列(以动态方式,我的真实df
非常大,有很多行,列和变量),当为TRUE时可以查看重复的行数,像这样:
aspected.df
id Country Year Time.step GSA.numb Species Quantity dup ref
1 ESP 1965 Month GSA 5 Mullus 500 FALSE NA
2 ESP 1965 Month GSA 5 Mullus 200 FALSE NA
3 ESP 1965 Month GSA 5 Mullus 200 TRUE =id2
4 ITA 1965 Month GSA 17 Eledone 350 FALSE NA
5 ITA 1965 Month GSA 17 Eledone 350 TRUE =id4
6 ITA 1965 Month GSA 17 Eledone 125 FALSE NA
我尝试过:
with(df, ave(as.character(Species), df[,2:6], FUN = make.unique))
但结果是:
[1] "Mullus" "Mullus.1" "Mullus.2" "Eledone" "Eledone.1" "Eledone.2"
我认为我需要更多的代码输入。哪些功能有用? (duplicated,make.unit, row.names
等等...)
答案 0 :(得分:4)
从初始文件开始的data.table
方法:
library(data.table)
setDT(df)[, `:=` (dup = seq_len(.N) > 1, ref = paste0("id", first(id))),
by = .(Country, Year, Time.step, GSA.numb, Species, Quantity)][dup == FALSE, ref := NA]
输出:
id Country Year Time.step GSA.numb Species Quantity dup ref
1: 1 ESP 1965 Month GSA5 Mullus 500 FALSE <NA>
2: 2 ESP 1965 Month GSA5 Mullus 200 FALSE <NA>
3: 3 ESP 1965 Month GSA5 Mullus 200 TRUE id2
4: 4 ITA 1965 Month GSA17 Eledone 350 FALSE <NA>
5: 5 ITA 1965 Month GSA17 Eledone 350 TRUE id4
6: 6 ITA 1965 Month GSA17 Eledone 125 FALSE <NA>
一种tidyverse
方法(之前已经创建了dup
)
library(tidyverse)
df %>%
group_by_at(vars(2:7)) %>%
mutate(ref = ifelse(dup, paste0("id", first(id)), NA_character_))
输出:
id Country Year Time.step GSA.numb Species Quantity dup ref
<int> <chr> <int> <chr> <chr> <chr> <int> <lgl> <chr>
1 1 ESP 1965 Month GSA5 Mullus 500 FALSE NA
2 2 ESP 1965 Month GSA5 Mullus 200 FALSE NA
3 3 ESP 1965 Month GSA5 Mullus 200 TRUE id2
4 4 ITA 1965 Month GSA17 Eledone 350 FALSE NA
5 5 ITA 1965 Month GSA17 Eledone 350 TRUE id4
6 6 ITA 1965 Month GSA17 Eledone 125 FALSE NA
如果您要在语句中创建dup
列:
df %>%
group_by_at(vars(2:7)) %>%
mutate(
dup = row_number() > 1,
ref = ifelse(dup, paste0("id", first(id)), NA_character_))
输出:
id Country Year Time.step GSA.numb Species Quantity dup ref
<int> <chr> <int> <chr> <chr> <chr> <int> <lgl> <chr>
1 1 ESP 1965 Month GSA5 Mullus 500 FALSE NA
2 2 ESP 1965 Month GSA5 Mullus 200 FALSE NA
3 3 ESP 1965 Month GSA5 Mullus 200 TRUE id2
4 4 ITA 1965 Month GSA17 Eledone 350 FALSE NA
5 5 ITA 1965 Month GSA17 Eledone 350 TRUE id4
6 6 ITA 1965 Month GSA17 Eledone 125 FALSE NA
答案 1 :(得分:2)
您可以使用take
函数来快速标识重复项
tidyverse
答案 2 :(得分:0)
此示例使用基数R并将找到的重复项与原始值进行匹配。如果单行也有多个重复项,这将很有帮助。
示例数据(使用dput(control = NULL)
,因此字符/因数已转换为数字)
df <- data.frame(id = c(1, 1, 1, 2, 2, 2),
Country = c(1965, 1965, 1965, 1965, 1965, 1965),
Year = c(1, 1, 1, 1, 1, 1),
Time.step = c(1, 1, 1, 1, 1, 1),
GSA.numb = c(5, 5, 5, 17, 17, 17),
Species = c(2, 2, 2, 1, 1, 1), Quantity = c(500, 200, 200, 350, 350, 125))
代码是矢量化的,因此,尽管有外部循环,它仍应在大型数据帧上相当快地运行。
df$dup <- duplicated(df)
dupes <- df[df$dup,]
df$ref <- NA # initialize
for(i in 1:nrow(dupes)){
z=which(df[,1] == dupes[i,1]&
df[,2] == dupes[i,2]&
df[,3] == dupes[i,3]&
df[,4] == dupes[i,4]&
df[,5] == dupes[i,5]&
df[,6] == dupes[i,6]&
df[,7] == dupes[i,7]) # make sure not to include that $dup column!
df$ref[z[-1]] <- paste0("=id",min(z))
}
df
# id Country Year Time.step GSA.numb Species Quantity dup ref
#1 1 1965 1 1 5 2 500 FALSE <NA>
#2 1 1965 1 1 5 2 200 FALSE <NA>
#3 1 1965 1 1 5 2 200 TRUE =id2
#4 2 1965 1 1 17 1 350 FALSE <NA>
#5 2 1965 1 1 17 1 350 TRUE =id4
#6 2 1965 1 1 17 1 125 FALSE <NA>
即使您可以通过应用功能来加强此功能,也可以更快地运行。
答案 3 :(得分:0)
使用<input type="text" placeholder="My Cool Placeholder Text">
:
tidyverse