我有一个数据框如下。在数据框中,对于“A”重复/出现的值“45”大于3次,对于“B”对于“67”也是相同的,现在需要将它们作为“空白/ NA”用于重复/冻结的那些大于3次(“New_value”)
Name Value New_Value
A 24 24
A 45 45
A 45
A 45
A 45
A 45
A 93 93
A 19 19
A 10 10
B 29 29
B 67 67
B 67
B 67
B 67
C 201 201
C 993 993
C 396 396
答案 0 :(得分:2)
FWIW,这是使用data.table
而非rleid()
的另一个duplicated()
解决方案。
请注意,如果值显示大于3次,则OP请求将下一个重复值设为空白。这意味着对于Value
重复两次没有空白应该出现在结果中。我修改了我的样本数据集,以包含两次重复相同值的情况。
编辑:OP没有明确表示他是否正在计算给定序列中相同Value
的重复次数,而不管Name
是否为Name
计算每个Value
组序列中的重复次数。另见this comment。
此外,OP没有指定他预期的结果,如果有一系列重复Name
但DT
# Name Value
# 1: A 24
# 2: A 24
# 3: A 45
# 4: A 45
# 5: A 45
# 6: A 45
# 7: A 45
# 8: A 93
# 9: A 19
#10: A 19
#11: A 10
#12: B 29
#13: B 67
#14: B 67
#15: B 67
#16: B 67
#17: C 201
#18: C 993
#19: C 396
#20: A 19
#21: A 19
#22: C 19
#23: B 29
#24: B 67
#25: B 67
#26: B 67
#27: B 67
#28: C 67
#29: C 67
#30: C 67
#31: C 67
# Name Value
发生了变化。
因此,我修改了我的示例数据集以包含其他用例:
NA
与其他答案一样,library(data.table)
setDT(DT)[, New := Value[.N < 3], by=rleid(Value)][rowid(rleid(Value)) == 1L, New := Value]
DT
# Name Value New
# 1: A 24 24
# 2: A 24 24
# 3: A 45 45
# 4: A 45 NA
# 5: A 45 NA
# 6: A 45 NA
# 7: A 45 NA
# 8: A 93 93
# 9: A 19 19
#10: A 19 19
#11: A 10 10
#12: B 29 29
#13: B 67 67
#14: B 67 NA
#15: B 67 NA
#16: B 67 NA
#17: C 201 201
#18: C 993 993
#19: C 396 396
#20: A 19 19
#21: A 19 NA
#22: C 19 NA
#23: B 29 29
#24: B 67 67
#25: B 67 NA
#26: B 67 NA
#27: B 67 NA
#28: C 67 NA
#29: C 67 NA
#30: C 67 NA
#31: C 67 NA
# Name Value New
为空白。
Value
第一个表达式为所有RLE组复制NA
,重复一次或两次。重复次数更多的所有RLE组获得Value
。第二个表达式仅为每个RLE组中的第一行复制Name
。
请注意,重复值的每个序列都是单独处理的,无论A
是什么,但第22行C
到B
和C
到setDT(DT)[, New := Value[.N < 3], by=rleid(Value)
][is.na(New) & rowid(rleid(Value)) == 1L, New := Value]
的变化第27行被忽略了。
这可以进一步改进,只有在尚未复制时才能复制:
Name
如果Value
的变化预期为&#34;重新启动&#34; setDT(DT)[, New := Value[.N < 3], by = rleid(Name, Value)
][is.na(New) & rowid(rleid(Name, Value)) == 1L, New := Value][]
# Name Value New
# 1: A 24 24
# 2: A 24 24
# 3: A 45 45
# 4: A 45 NA
# 5: A 45 NA
# ...
#18: C 993 993
#19: C 396 396
#20: A 19 19
#21: A 19 19
#22: C 19 19
#23: B 29 29
#24: B 67 67
#25: B 67 NA
#26: B 67 NA
#27: B 67 NA
#28: C 67 67
#29: C 67 NA
#30: C 67 NA
#31: C 67 NA
# Name Value New
也可以使用这个变体(Jaap的信用):
DT <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "A", "A",
"C", "B", "B", "B", "B", "B", "C", "C", "C", "C"), Value = c(24L,
24L, 45L, 45L, 45L, 45L, 45L, 93L, 19L, 19L, 10L, 29L, 67L, 67L,
67L, 67L, 201L, 993L, 396L, 19L, 19L, 19L, 29L, 67L, 67L, 67L,
67L, 67L, 67L, 67L, 67L)), .Names = c("Name", "Value"), row.names = c(NA,
-31L), class = "data.frame")
注意第21,22和27行的差异。
;WITH tbl_to_synch as (
-- Prepare table to update,
Select *,chk = CHECKSUM(*) from [dbo].[tableA]
)
MERGE tbl_to_synch as [Target]
USING (Select *,chk = CHECKSUM(*) from [dbo].[tableB]) as [source]
ON [Target].key = [source].key
WHEN MATCHED AND [Target].chk <> [source].chk THEN
-- UPDATE ONLY row that is changed
UPDATE
SET
column01 = [source].[column01]
,column02 = [source].[column01]
-- ....
,column59 = [source].[column59]
,column60 = [source].[column59]
WHEN NOT MATCHED BY TARGET THEN
insert (column01, column02, ...,column59,column60)
values (column01, column02, ...,column59,column60)
WHEN NOT MATCHED BY SOURCE THEN DELETE
-- Show what is changed
OUTPUT $action, ISNULL(INSERTED.key,DELETED.key);
请注意,第1行和第8行已经重复了w.r.t. OP的数据集覆盖了两次重复的情况,最后添加了几行。
答案 1 :(得分:0)
这是data.table
方法。我为
Names
组以下是代码:
library(data.table)
dt <- data.table(Names = LETTERS[1:5] %>% sample(100, replace = TRUE),
Value = sample(1:10, 100, replace = TRUE))
dt <- dt[order(Names, Value)]
# if you look for in-group duplicates
dt[, count := .N, by = .(Names, Value)][, New_Value := Value]
dt[ , dup_ingroup := duplicated(Value), by = Names]
dt[dup_ingroup & count > 3, New_Value := NA]
# if you look for all duplicates
dt[, count := .N, by = Value][, New_Value := Value]
dt[duplicated(Value) & count > 3, New_Value := NA]
注意强>
以下评论。
library(data.table)
library(dplyr)
set.seed(20170515)
dt <- data.table(Names = LETTERS[1:5] %>% sample(100, replace = TRUE),
Value = sample(1:10, 100, replace = TRUE))
dt <- dt[order(Names, Value)]
dt_1 <- copy(dt)
dt_2 <- copy(dt)
dt_Jaap <- copy(dt)
# Method 1
dt_1[, count := .N, by = .(Names, Value)][, New_Value := Value]
dt_1[ , dup_ingroup := duplicated(Value), by = Names]
dt_1[dup_ingroup & count > 3, New_Value := NA]
dt_1[, .N, by = is.na(New_Value)]
## is.na N
## 1: FALSE 73
## 2: TRUE 27
# Method 2
dt_2[, count := .N, by = Value][, New_Value := Value]
dt_2[duplicated(Value) & count > 3, New_Value := NA]
dt_2[, .N, by = is.na(New_Value)]
## is.na N
## 1: FALSE 12
## 2: TRUE 88
# Method suggested by @Jaap
dt_Jaap[, New_Value := Value][duplicated(Value) & .N > 3, New_Value := NA_integer_, by = .(Names, Value)]
dt_Jaap[, .N, by = is.na(New_Value)]
## is.na N
## 1: FALSE 10
## 2: TRUE 90
dt_Jaap
只保留每个值Value
的第一个元素的值。
答案 2 :(得分:-1)
和dplyr
/ tidyverse
方式,假设数据框的顺序无关紧要......
df <- data.frame(Name = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","C","C","C","C","C"),
Value = c(24,45,45,45,45,45,93,19,10,29,67,67,67,67,201,993,396,396,396),
stringsAsFactors = F)
library(dplyr)
df %>%
group_by(Name, Value) %>%
mutate(New_Value = ifelse(n() > 3 & row_number() > 1, NA, Value))
<强>更新强>
一种更强大的方法,可以处理多组相同的值...
df <- read.table(header = T, stringsAsFactors = F, text = "
Name Value
A 45
A 45
A 45
A 82
A 45
A 45
A 45
A 45
A 12
A 45
A 45
A 45
A 45
A 45
B 29
B 67
B 67
B 67
B 67
")
library(dplyr)
df %>%
group_by(Name) %>%
mutate(run_length = with(rle(Value), rep(lengths, lengths))) %>%
mutate(run_start = seq_along(Value) %in% cumsum(c(1, rle(Value)$lengths))) %>%
mutate(New_Value = ifelse(run_length < 4 | run_start, Value, NA)) %>%
ungroup() %>% select(-run_length, -run_start)