对于示例数据框:
df <- structure(list(region = structure(1:8, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h"), class = "factor"), y.2012 = c(5.5,
NA, 4.7, 3.6, NA, NA, 4.6, NA), y.2013 = c(5.7, NA, NA, 3.8,
NA, 6.2, NA, NA), y.2014 = c(NA, 6.3, NA, 4.1, 5.1, NA, NA, NA
)), .Names = c("region", "y.2012", "y.2013", "y.2014"), class = "data.frame", row.names = c(NA,
-8L))
我想添加一个额外的列来记录最新列的值。我到目前为止(from this question):
df$variable <- apply(df[-1], 1, function(x) {
i1 <- tail(x[!is.na(x)],1)
if(length(i1)>0) i1 else NA})
df$variable
此外,我想添加(作为另一栏)“变量”数据的年份。
任何人都可以帮我吗?
答案 0 :(得分:4)
您可以通过以下方式实现这一目标:
df1$variable <- apply(df1[,-1], 1, function(x) names(x)[!is.na(x)][sum(!is.na(x))])
给出:
> df1
region y.2012 y.2013 y.2014 variable
1 a 5.5 5.7 NA y.2013
2 b NA NA 6.3 y.2014
3 c 4.7 NA NA y.2012
4 d 3.6 3.8 4.1 y.2014
5 e NA NA 5.1 y.2014
6 f NA 6.2 NA y.2013
7 g 4.6 NA NA y.2012
8 h NA NA NA
您可以用以下内容替换空单元格:
df1[df1$variable=='character(0)','variable'] <- NA
给出:
> df1
region y.2012 y.2013 y.2014 variable
1 a 5.5 5.7 NA y.2013
2 b NA NA 6.3 y.2014
3 c 4.7 NA NA y.2012
4 d 3.6 3.8 4.1 y.2014
5 e NA NA 5.1 y.2014
6 f NA 6.2 NA y.2013
7 g 4.6 NA NA y.2012
8 h NA NA NA NA
正如评论中所说,最好先重新整理你的长格式,然后看看哪一年有最后一个值。使用data.table
包:
library(data.table)
df2 <- melt(setDT(df1), id.vars='region', variable.name = 'year')
df2[, year := as.integer(gsub('^y.','',year))
][, var := tail(year[!is.na(value)],1), by = region]
给出:
> df2
region year value var
1: a 2012 5.5 2013
2: b 2012 NA 2014
3: c 2012 4.7 2012
4: d 2012 3.6 2014
5: e 2012 NA 2014
6: f 2012 NA 2013
7: g 2012 4.6 2012
8: h 2012 NA NA
9: a 2013 5.7 2013
10: b 2013 NA 2014
11: c 2013 NA 2012
12: d 2013 3.8 2014
13: e 2013 NA 2014
14: f 2013 6.2 2013
15: g 2013 NA 2012
16: h 2013 NA NA
17: a 2014 NA 2013
18: b 2014 6.3 2014
19: c 2014 NA 2012
20: d 2014 4.1 2014
21: e 2014 5.1 2014
22: f 2014 NA 2013
23: g 2014 NA 2012
24: h 2014 NA NA
dplyr
&amp;的类似解决方案tidyr
:
library(dplyr)
library(tidyr)
df2 <- df1 %>%
gather(year, value, -1) %>%
mutate(year = as.integer(gsub('^y.','',year))) %>%
group_by(region) %>%
mutate(var = as.integer(ifelse(all(is.na(value)==TRUE), NA, tail(year[!is.na(value)],1))))
答案 1 :(得分:0)
你可以使用reshape2包的melt函数转换为long格式,然后使用stringr包函数str_replace来获得没有“y”的一年。字首。见下文,首先转换为长格式:
library(reshape2)
df2 <- reshape2::melt(df,
id.vars="region",
variable.name="yearStr")
df2
输出:
region yearStr value
1 a y.2012 5.5
2 b y.2012 NA
3 c y.2012 4.7
4 d y.2012 3.6
...
然后,确定年份:
df2$year <- as.numeric(stringr::str_replace(df2$yearStr, "y.", ""))
df2
输出:
region yearStr value year
1 a y.2012 5.5 2012
2 b y.2012 NA 2012
3 c y.2012 4.7 2012
4 d y.2012 3.6 2012
...
使用年份标签(使用dplyr)获取最近一年的行:
library(dplyr)
regions <- group_by(df2, region)
df3 <- filter(regions[!is.na(regions$value),], min_rank(desc(year)) <= 1)
as.data.frame(df3)
输出:
region yearStr value year
1 c y.2012 4.7 2012
2 g y.2012 4.6 2012
3 a y.2013 5.7 2013
4 f y.2013 6.2 2013
5 b y.2014 6.3 2014
6 d y.2014 4.1 2014
7 e y.2014 5.1 2014
当然不如@Procrastinatus Maximus简洁,但中间结果可能对绘图或其他分析有一些好处。
已修订:添加了dplyr以仅显示每个区域的最新数据行。