添加详细说明年份数据来自的列

时间:2016-05-22 15:02:54

标签: r

对于示例数据框:

df <- structure(list(region = structure(1:8, .Label = c("a", "b", "c", 
"d", "e", "f", "g", "h"), class = "factor"), y.2012 = c(5.5, 
NA, 4.7, 3.6, NA, NA, 4.6, NA), y.2013 = c(5.7, NA, NA, 3.8, 
NA, 6.2, NA, NA), y.2014 = c(NA, 6.3, NA, 4.1, 5.1, NA, NA, NA
)), .Names = c("region", "y.2012", "y.2013", "y.2014"), class = "data.frame", row.names = c(NA, 
-8L))

我想添加一个额外的列来记录最新列的值。我到目前为止(from this question):

df$variable <- apply(df[-1], 1, function(x) {
  i1 <- tail(x[!is.na(x)],1)
  if(length(i1)>0) i1 else NA})
df$variable

此外,我想添加(作为另一栏)“变量”数据的年份。

任何人都可以帮我吗?

2 个答案:

答案 0 :(得分:4)

您可以通过以下方式实现这一目标:

df1$variable <- apply(df1[,-1], 1, function(x) names(x)[!is.na(x)][sum(!is.na(x))])

给出:

> df1
  region y.2012 y.2013 y.2014 variable
1      a    5.5    5.7     NA   y.2013
2      b     NA     NA    6.3   y.2014
3      c    4.7     NA     NA   y.2012
4      d    3.6    3.8    4.1   y.2014
5      e     NA     NA    5.1   y.2014
6      f     NA    6.2     NA   y.2013
7      g    4.6     NA     NA   y.2012
8      h     NA     NA     NA         

您可以用以下内容替换空单元格:

df1[df1$variable=='character(0)','variable'] <- NA

给出:

> df1
  region y.2012 y.2013 y.2014 variable
1      a    5.5    5.7     NA   y.2013
2      b     NA     NA    6.3   y.2014
3      c    4.7     NA     NA   y.2012
4      d    3.6    3.8    4.1   y.2014
5      e     NA     NA    5.1   y.2014
6      f     NA    6.2     NA   y.2013
7      g    4.6     NA     NA   y.2012
8      h     NA     NA     NA       NA

正如评论中所说,最好先重新整理你的长格式,然后看看哪一年有最后一个值。使用data.table包:

library(data.table)
df2 <- melt(setDT(df1), id.vars='region', variable.name = 'year')
df2[, year := as.integer(gsub('^y.','',year))
    ][, var := tail(year[!is.na(value)],1), by = region]

给出:

> df2
    region year value  var
 1:      a 2012   5.5 2013
 2:      b 2012    NA 2014
 3:      c 2012   4.7 2012
 4:      d 2012   3.6 2014
 5:      e 2012    NA 2014
 6:      f 2012    NA 2013
 7:      g 2012   4.6 2012
 8:      h 2012    NA   NA
 9:      a 2013   5.7 2013
10:      b 2013    NA 2014
11:      c 2013    NA 2012
12:      d 2013   3.8 2014
13:      e 2013    NA 2014
14:      f 2013   6.2 2013
15:      g 2013    NA 2012
16:      h 2013    NA   NA
17:      a 2014    NA 2013
18:      b 2014   6.3 2014
19:      c 2014    NA 2012
20:      d 2014   4.1 2014
21:      e 2014   5.1 2014
22:      f 2014    NA 2013
23:      g 2014    NA 2012
24:      h 2014    NA   NA

dplyr&amp;的类似解决方案tidyr

library(dplyr)
library(tidyr)
df2 <- df1 %>%
  gather(year, value, -1) %>%
  mutate(year = as.integer(gsub('^y.','',year))) %>%
  group_by(region) %>%
  mutate(var = as.integer(ifelse(all(is.na(value)==TRUE), NA, tail(year[!is.na(value)],1))))

答案 1 :(得分:0)

你可以使用reshape2包的melt函数转换为long格式,然后使用stringr包函数str_replace来获得没有“y”的一年。字首。见下文,首先转换为长格式:

library(reshape2)
df2 <- reshape2::melt(df, 
                      id.vars="region",
                      variable.name="yearStr")
df2

输出:

   region yearStr value
1       a  y.2012   5.5
2       b  y.2012    NA
3       c  y.2012   4.7
4       d  y.2012   3.6
...

然后,确定年份:

df2$year <- as.numeric(stringr::str_replace(df2$yearStr, "y.", ""))

df2

输出:

region  yearStr   value year
1       a  y.2012   5.5 2012
2       b  y.2012    NA 2012
3       c  y.2012   4.7 2012
4       d  y.2012   3.6 2012
...

使用年份标签(使用dplyr)获取最近一年的行:

library(dplyr)
regions <- group_by(df2, region)
df3 <- filter(regions[!is.na(regions$value),], min_rank(desc(year)) <= 1)
as.data.frame(df3)

输出:

  region yearStr value year
1      c  y.2012   4.7 2012
2      g  y.2012   4.6 2012
3      a  y.2013   5.7 2013
4      f  y.2013   6.2 2013
5      b  y.2014   6.3 2014
6      d  y.2014   4.1 2014
7      e  y.2014   5.1 2014

当然不如@Procrastinatus Maximus简洁,但中间结果可能对绘图或其他分析有一些好处。

已修订:添加了dplyr以仅显示每个区域的最新数据行。