我需要在data.frame中按组检测序列并计算新变量。
考虑一下,我在data.frame
之后有这个名字:
df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "C", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C", "A,B,C", "A,B,C", "A,B,C,D", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
df1
> df1
ID seqs count product stock
1 1 1 2 A A
2 1 2 1 B A,B
3 1 3 3 C A,B,C
4 1 4 1 C A,B,C
5 1 5 1 A,B A,B,C
6 1 6 2 A,B,C A,B,C
7 1 7 3 D A,B,C,D
8 2 1 1 A A
9 2 2 2 B A,B
10 2 3 1 A A,B
11 3 1 3 A A
12 3 2 1 A,B,C A,B,C
13 3 3 4 D A,B,C,D
14 3 4 1 D A,B,C,D
我有兴趣按照以下顺序为ID
计算一个量度:
- Count == 1 - Count > 1 - Count == 1
在示例中,这适用于:
- rows 2, 3, 4 for `ID==1` - rows 8, 9, 10 for `ID==2` - rows 12, 13, 14 for `ID==3`
对于这些ID和行,我需要计算一个称为new
的度量,该度量采用序列product
的最后一行的if
的值,该值位于第二行而不是第一个序列的stock
中的>
所需结果如下所示:
> output
ID seq1 seq2 seq3 new
1 1 2 3 4 C
2 2 1 2 3
3 3 2 3 4 D
注意:
ID
具有多个限定序列。全部都应该记录下来。ID
,最多1500个序列。您如何编写一段有效的代码来获得此输出?
答案 0 :(得分:1)
这是一个data.table
选项:
library(data.table)
char_cols <- c("product", "stock")
setDT(df1)[,
(char_cols) := lapply(.SD, as.character),
.SDcols = char_cols] # in case they're factors
df1[, c1 := (count == 1) &
(shift(count) > 1) &
(shift(count, 2L) == 1),
by = ID] #condition1
df1[, pat := paste0("(", gsub(",", "|", product), ")")] # pattern
df1[, c2 := mapply(grepl, pat, shift(product)) &
!mapply(grepl, pat, shift(stock, 2L)),
by = ID] # condition2
df1[(c1), new := ifelse(c2, product, "")] # create new column
df1[, paste0("seq", 1:3) := shift(seqs, 2:0)] # create seq columns
df1[(c1), .(ID, seq1, seq2, seq3, new)] # result
答案 1 :(得分:1)
这是使用tidyverse的另一种方法;但是,我认为lag
和lead
使该解决方案有些耗时。我在代码中包含了注释,以使其更清晰。
但是我花了足够的时间在它上发布它。
library(tidyverse)
df1 %>% group_by(ID) %>%
# this finds the row with count > 1 which ...
#... the counts of the row before and the one of after it equals to 1
mutate(test = (count > 1 & c(F, lag(count==1)[-1]) & c(lead(count==1)[-n()],F))) %>%
# this makes a column which has value of True for each chunk...
#that meets desired condition to later filter based on it
mutate(test2 = test | c(F,lag(test)[-1]) | c(lead(test)[-n()], F)) %>%
filter(test2) %>% ungroup() %>%
# group each three occurrences in case of having multiple ones within each ID
group_by(G=trunc(3:(n()+2)/3)) %>% group_by(ID,G) %>%
# creating new column with string extracting techniques ...
#... (assuming those columns are characters)
mutate(new=
str_remove_all(
as.character(regmatches(stock[2], gregexpr(product[3], stock[2]))),
stock[1])) %>%
# selecting desired columns and adding times for long to wide conversion
select(ID,G,seqs,new) %>% mutate(times = 1:n()) %>% ungroup() %>%
# long to wide conversion using tidyr (part of tidyverse)
gather(key, value, -ID, -G, -new, -times) %>%
unite(col, key, times) %>% spread(col, value) %>%
# making the desired order of columns
select(-G,-new,new) %>% as.data.frame()
# ID seqs_1 seqs_2 seqs_3 new
# 1 1 2 3 4 C
# 2 2 1 2 3
# 3 3 2 3 4 D