最终产品的新问题/部分已浮出水面
我的输入文件如下所示:
NAME; YEAR; ID; VALUE
Sample1; 1998; 354; 45
Sample1; 1999; 354; 23
Sample1; 2000; 354; 66
Sample1; 2001; 354; 98
Sample1; 2002; 354; 36
Sample1; 2003; 354; 59
Sample1; 2004; 354; 64
Sample1; 2005; 354; 23
Sample1; 2006; 354; 69
Sample1; 2007; 354; 94
Sample1; 2008; 354; 24
Sample2; 1964; 1342; 7
Sample2; 1965; 1342; 24
Sample3; 2002; 859; 90
Sample3; 2003; 859; 93
Sample3; 2004; 859; 53
Sample3; 2005; 859; 98
我想要做的是在每个样本组的顶部添加一行(例如样本1的一行,样本2的一行等),其中包含来自初始行的所有相同值,除了对于VALUE字段,它应包含值0,对于YEAR字段,它应该是前一年。
我最终输出的大约80000个样本应该如下所示:
NAME; YEAR; ID; VALUE
Sample1; 1997; 354; 0
Sample1; 1998; 354; 45
Sample1; 1999; 354; 23
Sample1; 2000; 354; 66
Sample1; 2001; 354; 98
Sample1; 2002; 354; 36
Sample1; 2003; 354; 59
Sample1; 2004; 354; 64
Sample1; 2005; 354; 23
Sample1; 2006; 354; 69
Sample1; 2007; 354; 94
Sample1; 2008; 354; 24
Sample2; 1963; 354; 0
Sample2; 1964; 1342; 7
Sample2; 1965; 1342; 24
Sample3; 2001; 1342; 0
Sample3; 2002; 859; 90
Sample3; 2003; 859; 93
Sample3; 2004; 859; 53
Sample3; 2005; 859; 98
感谢您的帮助!
答案 0 :(得分:2)
假设您的data.frame为df
,我会在基数R中执行此操作:
df <- do.call(rbind, lapply(split(df, df$NAME), function(x) {
x <- rbind(x[1,], x); x[1,"VALUE"] <- 0; x[1, "YEAR"] <- x[1, "YEAR"] -1;
return(x)}))
如果需要,可以将行名称更改回正常编号
rownames(df) <- seq_len(nrow(df))
df
# NAME YEAR ID VALUE
#1 Sample1 1997 354 0
#2 Sample1 1998 354 45
#3 Sample1 1999 354 23
#4 Sample1 2000 354 66
#5 Sample1 2001 354 98
#6 Sample1 2002 354 36
#7 Sample1 2003 354 59
#8 Sample1 2004 354 64
#9 Sample1 2005 354 23
#10 Sample1 2006 354 69
#11 Sample1 2007 354 94
#12 Sample1 2008 354 24
#13 Sample2 1963 1342 0
#14 Sample2 1964 1342 7
#15 Sample2 1965 1342 24
#16 Sample3 2001 859 0
#17 Sample3 2002 859 90
#18 Sample3 2003 859 93
#19 Sample3 2004 859 53
#20 Sample3 2005 859 98
在上面的代码中简单组合在一起的步骤说明:
# split by sample
lst <- split(df, df$NAME)
# add the first row to each sample
lst <- lapply(lst, function(x) rbind(x[1,], x))
# change the YEAR and VALUE entries in each first row
lst <- lapply(lst, function(x) {x[1,"VALUE"] <- 0; x[1, "YEAR"] <- x[1, "YEAR"] -1; return(x)})
# rbind back to a data frame
df <- do.call(rbind, lst)
答案 1 :(得分:1)
读入您的数据:
d <- read.table(text = "NAME; YEAR; ID; VALUE
Sample1; 1998; 354; 45
Sample1; 1999; 354; 23
Sample1; 2000; 354; 66
Sample1; 2001; 354; 98
Sample1; 2002; 354; 36
Sample1; 2003; 354; 59
Sample1; 2004; 354; 64
Sample1; 2005; 354; 23
Sample1; 2006; 354; 69
Sample1; 2007; 354; 94
Sample1; 2008; 354; 24
Sample2; 1964; 1342; 7
Sample2; 1965; 1342; 24
Sample3; 2002; 859; 90
Sample3; 2003; 859; 93
Sample3; 2004; 859; 53
Sample3; 2005; 859; 98 ", header = TRUE, sep = ";", stringsAsFactors = FALSE)
无论出于何种原因,我觉得这样做是为了一个循环:
tmp <- as.factor(d$NAME)
d2 <- setNames(data.frame(matrix(nrow=(nrow(d)+nlevels(tmp)), ncol=ncol(d))),
names(d))
s <- split(d, d$NAME)
j <- 1
for(i in 1:nlevels(tmp)) {
d2[j,] <- c(s[[i]][1,1], s[[i]][1,2]-1, s[[i]][1,3], 0)
d2[(j+1):(j + nrow(s[[i]])), ] <- s[[i]]
j <- j + nrow(s[[i]]) + 1
}
结果:
NAME YEAR ID VALUE
1 Sample1 1997 354 0
2 Sample1 1998 354 45
3 Sample1 1999 354 23
4 Sample1 2000 354 66
5 Sample1 2001 354 98
6 Sample1 2002 354 36
7 Sample1 2003 354 59
8 Sample1 2004 354 64
9 Sample1 2005 354 23
10 Sample1 2006 354 69
11 Sample1 2007 354 94
12 Sample1 2008 354 24
13 Sample2 1963 1342 0
14 Sample2 1964 1342 7
15 Sample2 1965 1342 24
16 Sample3 2001 859 0
17 Sample3 2002 859 90
18 Sample3 2003 859 93
19 Sample3 2004 859 53
20 Sample3 2005 859 98
答案 2 :(得分:1)
您可以尝试data.table
更大的数据集:
library(data.table)
DT <- data.table(dat)
sub <- unique(DT, by="NAME")[, c("YEAR", "VALUE") := list(YEAR-1, 0)]
rbindlist(list(DT, sub))[order(NAME, YEAR)]
# NAME YEAR ID VALUE
# 1: Sample1 1997 354 0
# 2: Sample1 1998 354 45
# 3: Sample1 1999 354 23
# 4: Sample1 2000 354 66
# 5: Sample1 2001 354 98
# 6: Sample1 2002 354 36
# 7: Sample1 2003 354 59
# 8: Sample1 2004 354 64
# 9: Sample1 2005 354 23
#10: Sample1 2006 354 69
#11: Sample1 2007 354 94
#12: Sample1 2008 354 24
#13: Sample2 1963 1342 0
#14: Sample2 1964 1342 7
#15: Sample2 1965 1342 24
#16: Sample3 2001 859 0
#17: Sample3 2002 859 90
#18: Sample3 2003 859 93
#19: Sample3 2004 859 53
#20: Sample3 2005 859 98
正如@Arun所建议的那样,更紧凑的代码将是
DT[, list(YEAR=c(YEAR[1L]-1L, YEAR), VALUE=c(0,VALUE)), by=list(NAME,ID)]