我的数据就像这样
df<- structure(list(label = c("afghanestan", "afghanestan", "afghanestanIndia",
"afghanestanindiaholad", "afghanestanUSA", "USA", "Argentina",
"Brazil", "Argentinabrazil", "Brazil"), Start = c(114, 516, 89,
22, 33, 67, 288, 362, 45, 362), Stop = c(127, 544, 105, 34, 50,
85, 299, 381, 68, 381)), class = "data.frame", .Names = c("label",
"Start", "Stop"), row.names = c(NA, -10L))
当我想删除完全相同的内容时,我只是这样做
df[!duplicated(df[,c('label','Start','Stop')]),]
现在的问题是我想要识别标签中相似但在开始和结束时可能不同的那些。所以我想在事后生成这样的东西
label Start Stop NewLab
1 afghanestan 114 127 TRUE
2 afghanestan 516 544 TRUE
3 afghanestanIndia 89 105 FALSE
4 afghanestanindiaholad 22 34 FALSE
5 afghanestanUSA 33 50 FLASE
6 USA 67 85 FALSE
7 Argentina 288 299 FALSE
8 Brazil 362 381 FALSE
9 Argentinabrazil 45 68 FALSE
答案 0 :(得分:1)
这可以在一行代码中使用:
df$NewLab <- df$label %in% df[duplicated(df$label), ]$label
输出:
> df$NewLab <- df$label %in% df[duplicated(df$label), ]$label
> df
label Start Stop NewLab
1 afghanestan 114 127 TRUE
2 afghanestan 516 544 TRUE
3 afghanestanIndia 89 105 FALSE
4 afghanestanindiaholad 22 34 FALSE
5 afghanestanUSA 33 50 FALSE
6 USA 67 85 FALSE
7 Argentina 288 299 FALSE
8 Brazil 362 381 FALSE
9 Argentinabrazil 45 68 FALSE
或dplyr
表示法:
df <- dplyr::mutate(df, NewLab = label %in% df[duplicated(df$label), ]$label)
答案 1 :(得分:0)
以下是使用dplyr
library(tidyverse)
df %>%
group_by(label) %>%
mutate(n = n()) %>%
group_by(Start, Stop) %>%
mutate(n2 = n()) %>%
mutate(newlabel = ifelse(n>1 & n2==1, TRUE, FALSE)) %>%
dplyr::select(-n, -n2)
首先创建一个标签的分组变量 - 计数,然后是开始和停止时间的分组变量 - 进行计数,使用ifelse
分配True / False,然后删除中间列。