我有一个看起来像这样的数据框,我正在尝试计算行VALUE的累积总和。输入文件也可以在这里找到:https://dl.dropboxusercontent.com/u/16277659/input.csv
df <-read.csv("input.csv", sep=";", header=TRUE)
NAME; ID; SURVEY_YEAR REFERENCE_YEAR; VALUE
SAMPLE1; 253; 1880; 1879; 14
SAMPLE1; 253; 1881; 1880; -10
SAMPLE1; 253; 1882; 1881; 4
SAMPLE1; 253; 1883; 1882; 10
SAMPLE1; 253; 1884; 1883; 10
SAMPLE1; 253; 1885; 1884; 12
SAMPLE1; 253; 1889; 1888; 11
SAMPLE1; 253; 1890; 1889; 12
SAMPLE1; 253; 1911; 1910; -16
SAMPLE1; 253; 1913; 1911; -11
SAMPLE1; 253; 1914; 1913; -8
SAMPLE2; 261; 1992; 1991; -19
SAMPLE2; 261; 1994; 1992; -58
SAMPLE2; 261; 1995; 1994; -40
SAMPLE2; 261; 1996; 1995; -21
SAMPLE2; 261; 1997; 1996; -50
SAMPLE2; 261; 1998; 1997; -60
SAMPLE2; 261; 2005; 2004; -34
SAMPLE2; 261; 2006; 2005; -23
SAMPLE2; 261; 2007; 2006; -19
SAMPLE2; 261; 2008; 2007; -29
SAMPLE2; 261; 2009; 2008; -89
SAMPLE2; 261; 2013; 2009; -14
SAMPLE2; 261; 2014; 2013; -16
我所针对的最终产品是每个SAMPLE的图,其中在x轴上绘制了SURVEY_YEAR,在y轴上绘制了后来计算的VALUE的累积和CUMSUM。 到目前为止,我的代码是为了整理数据:
# Filter out all values with less than 3 measurements by group (in this case does nothing, but is important with the rest of my data)
df <-read.csv("input.csv", sep=";", header=TRUE)
rowsn <- with(df,by(VALUE,ID,function(xx)sum(!is.na(xx))))
names(which(rowsn>=3))
dat <- df[df$ID %in% names(which(rowsn>=3)),]
# write new column which defines the beginning of the group (split by ID) and for the cumsum function(=0)
dat <- do.call(rbind, lapply(split(dat, dat$ID), function(x){
x <- rbind(x[1,],x); x[1, "VALUE"] <- 0; x[1, "SURVEY_YEAR"] <- x[1, "SURVEY_YEAR"] -1; return(x)}))
rownames(dat) <- seq_len(nrow(dat))
# write dat to csv file for inspection
write.table(dat, "dat.csv", sep=";", row.names=FALSE)
这会产生以下数据帧,这是计算行VALUE累积总和的起点。
NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUE
SAMPLE1; 253; 1879; 1879; 0
SAMPLE1; 253; 1880; 1879; 14
SAMPLE1; 253; 1881; 1880; -10
SAMPLE1; 253; 1882; 1881; 4
SAMPLE1; 253; 1883; 1882; 10
SAMPLE1; 253; 1884; 1883; 10
SAMPLE1; 253; 1885; 1884; 12
SAMPLE1; 253; 1889; 1888; 11
SAMPLE1; 253; 1890; 1889; 12
SAMPLE1; 253; 1911; 1910; -16
SAMPLE1; 253; 1913; 1911; -11
SAMPLE1; 253; 1914; 1913; -8
SAMPLE2; 261; 1991; 1991; 0
SAMPLE2; 261; 1992; 1991; -19
SAMPLE2; 261; 1994; 1992; -58
SAMPLE2; 261; 1995; 1994; -40
SAMPLE2; 261; 1996; 1995; -21
SAMPLE2; 261; 1997; 1996; -50
SAMPLE2; 261; 1998; 1997; -60
SAMPLE2; 261; 2005; 2004; -34
SAMPLE2; 261; 2006; 2005; -23
SAMPLE2; 261; 2007; 2006; -19
SAMPLE2; 261; 2008; 2007; -29
SAMPLE2; 261; 2009; 2008; -89
SAMPLE2; 261; 2013; 2009; -14
SAMPLE2; 261; 2014; 2013; -16
现在的问题是我想计算每年的VALUE行的累计总和。正如你所看到的那样,我在某些年份之间存在差距(例如,在1890年至1911年的SAMPLE1和1998年至2005年的SAMPLE2之间),我想填补每年与NA值之间的差距,以便我可以用绘图类型绘图=&#39; b&#39; (点和线),以便不连接不同的间隙。重要的是,如果彼此之后存在多个NA值,则在CUMSUM行中,最后一个NA值应该替换为之前的最后一个数值。
正常情况是REFERENCE_YEAR和SURVEY_YEAR之间的差异等于1(例如,对于SAMPLE1的第一个示例,从1880到1881),但在某些情况下,REFERENCE_YEAR和SURVEY_YEAR之间存在不同的时间段(例如,在SAMPLE1中)从1911年到1913年,在SAMPLE2从2009年到2013年)。如果是这种情况,累积和的函数应该只应用一次,并且值应该在指定的时间内保持不变(在图中,这会产生一条连接的直线)。
很难详细解释所有内容,如果我提供一个结果应该是什么样子的例子,它可能会更容易:
NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUE; CUMSUM
SAMPLE1; 253; 1879; 1879; 0; 0
SAMPLE1; 253; 1880; 1879; 14; 14
SAMPLE1; 253; 1881; 1880; -10; 4
SAMPLE1; 253; 1882; 1881; 4; 8
SAMPLE1; 253; 1883; 1882; 10; 18
SAMPLE1; 253; 1884; 1883; 10; 28
SAMPLE1; 253; 1885; 1884; 12; 40
SAMPLE1; 253; 1886; 1885; NA; NA
SAMPLE1; 253; 1887; 1886; NA; NA
SAMPLE1; 253; 1888; 1887; NA; 40
SAMPLE1; 253; 1889; 1888; 11; 51
SAMPLE1; 253; 1890; 1889; 12; 63
SAMPLE1; 253; 1891; 1890; NA; NA
SAMPLE1; 253; 1892; 1891; NA; NA
SAMPLE1; 253; 1893; 1892; NA; NA
SAMPLE1; 253; 1894; 1893; NA; NA
SAMPLE1; 253; 1895; 1894; NA; NA
SAMPLE1; 253; 1896; 1895; NA; NA
SAMPLE1; 253; 1897; 1896; NA; NA
SAMPLE1; 253; 1898; 1897; NA; NA
SAMPLE1; 253; 1899; 1898; NA; NA
SAMPLE1; 253; 1900; 1899; NA; NA
SAMPLE1; 253; 1901; 1900; NA; NA
SAMPLE1; 253; 1902; 1901; NA; NA
SAMPLE1; 253; 1903; 1902; NA; NA
SAMPLE1; 253; 1904; 1903; NA; NA
SAMPLE1; 253; 1905; 1904; NA; NA
SAMPLE1; 253; 1906; 1905; NA; NA
SAMPLE1; 253; 1907; 1906; NA; NA
SAMPLE1; 253; 1908; 1907; NA; NA
SAMPLE1; 253; 1909; 1908; NA; NA
SAMPLE1; 253; 1910; 1909; NA; 63
SAMPLE1; 253; 1911; 1910; -16; 47
SAMPLE1; 253; 1912; 1911; -11; 36
SAMPLE1; 253; 1913; 1912; -11; 36
SAMPLE1; 253; 1914; 1913; -8; 28
SAMPLE2; 253; 1991; 1991; 0; 0
SAMPLE2; 253; 1992; 1991; -19; -19
SAMPLE2; 253; 1993; 1992; -58; -77
SAMPLE2; 253; 1994; 1993; -58; -135
SAMPLE2; 253; 1995; 1994; -40; -175
SAMPLE2; 253; 1996; 1995; -21; -196
SAMPLE2; 253; 1997; 1996; -50; -246
SAMPLE2; 253; 1998; 1997; -60; -306
SAMPLE2; 253; 1999; 1998; NA; NA
SAMPLE2; 253; 2000; 1999; NA; NA
SAMPLE2; 253; 2001; 2000; NA; NA
SAMPLE2; 253; 2002; 2001; NA; NA
SAMPLE2; 253; 2003; 2002; NA; NA
SAMPLE2; 253; 2004; 2003; NA; -306
SAMPLE2; 253; 2005; 2004; -34; -340
SAMPLE2; 253; 2006; 2005; -23; -363
SAMPLE2; 253; 2007; 2006; -19; -382
SAMPLE2; 253; 2008; 2007; -29; -411
SAMPLE2; 253; 2009; 2008; -89; -500
SAMPLE2; 253; 2010; 2009; -14; -514
SAMPLE2; 253; 2011; 2010; -14; -514
SAMPLE2; 253; 2012; 2011; -14; -514
SAMPLE2; 253; 2013; 2012; -14; -514
SAMPLE2; 253; 2014; 2013; -16; -530
非常感谢帮助这个相当复杂的案例!谢谢!
答案 0 :(得分:0)
BIG EDIT:已发布代码,添加了正确的库调用
library(dplyr)
df = read.csv("input.csv", sep=";", stringsAsFactors=FALSE)
#find min/max year for each SAMPLE
df_minmax = df %>%
group_by(NAME) %>%
summarise(min_year = min(SURVEY_YEAR),
max_year = max(SURVEY_YEAR))
#create an empty dataframe with what we want
df2 = data.frame(NAME = "",
ID = 0,
SURVEY_YEAR = min(df$SURVEY_YEAR):max(df$SURVEY_YEAR),
REFERENCE_YEAR = min(df$SURVEY_YEAR):max(df$SURVEY_YEAR) - 1,
VALUE = NA, stringsAsFactors=FALSE)
#fill in the NAMES dataframe - there's probably a better way to do this
for(i in 1:nrow(df_minmax)) {
min_year = df_minmax[i, ]$min_year
max_year = df_minmax[i, ]$max_year
df2[df2$SURVEY_YEAR %in% min_year:max_year, ]$NAME = df_minmax[i, ]$NAME
}
#fill in the values
#this line is a bit dangerous -- it relies on the fact that df1 and df2 have the same relative ordering
#don't change the ordering of df and df2 before this line.
df2[df2$SURVEY_YEAR %in% df$SURVEY_YEAR, ]$VALUE = df$VALUE
#in this example there is a long period between sample1 and sample2 we can filter those out
df2 = df2 %>% filter(NAME != "")
#Now we can do all the cumulative stuff
#for purposes of cumulative sums, set NA to 0
temp = df2$VALUE
df2[is.na(df2)] = 0
df2 = df2 %>% group_by(NAME) %>% mutate(csum = cumsum(VALUE))
#get back the NA values -- in case the NA values are useful to you
df2$VALUE = temp
这是`head(df2):
NAME ID SURVEY_YEAR REFERENCE_YEAR VALUE csum
1 SAMPLE1 0 1880 1879 14 14
2 SAMPLE1 0 1881 1880 -10 4
3 SAMPLE1 0 1882 1881 4 8
4 SAMPLE1 0 1883 1882 10 18
5 SAMPLE1 0 1884 1883 10 28
6 SAMPLE1 0 1885 1884 12 40
7 SAMPLE1 0 1886 1885 NA 40
8 SAMPLE1 0 1887 1886 NA 40
9 SAMPLE1 0 1888 1887 NA 40
10 SAMPLE1 0 1889 1888 11 51
11 SAMPLE1 0 1890 1889 12 63
12 SAMPLE1 0 1891 1890 NA 63
13 SAMPLE1 0 1892 1891 NA 63
14 SAMPLE1 0 1893 1892 NA 63
15 SAMPLE1 0 1894 1893 NA 63
16 SAMPLE1 0 1895 1894 NA 63
17 SAMPLE1 0 1896 1895 NA 63
18 SAMPLE1 0 1897 1896 NA 63
19 SAMPLE1 0 1898 1897 NA 63
20 SAMPLE1 0 1899 1898 NA 63
以上是我在上面作为快速摘要所做的步骤的概述:
对for
循环有些苛刻。我希望没有人能把它搞定。