R:将年度指标变量转换为一年变量

时间:2018-08-01 16:35:00

标签: r

我有一个数据框,其中包含一组列,这些列是给定年份的指标变量。例如,年份为1980的行的“ d80”列为1,否则为0。

for(i in names(df)[31:35]){
  print(c(i, df[[i]][0:5]))
}

[1] "d80" "1"   "0"   "0"   "0"   "0"  
[1] "d81" "0"   "1"   "0"   "0"   "0"  
[1] "d82" "0"   "0"   "1"   "0"   "0"  
[1] "d83" "0"   "0"   "0"   "1"   "0"  
[1] "d84" "0"   "0"   "0"   "0"   "1"  

提出了另一种方式:

head(data$d80)
[1] 1 0 0 0 0 0

head(data$d81)
[1] 0 1 0 0 0 0

第三种方式:

> x = df[1:3, 31:55]
> dput(x)
structure(list(d80 = c(1L, 0L, 0L), d81 = c(0L, 1L, 0L), d82 = c(0L, 
0L, 1L), d83 = c(0L, 0L, 0L), d84 = c(0L, 0L, 0L), d85 = c(0L, 
0L, 0L), d86 = c(0L, 0L, 0L), d87 = c(0L, 0L, 0L), d88 = c(0L, 
0L, 0L), d89 = c(0L, 0L, 0L), d90 = c(0L, 0L, 0L), d91 = c(0L, 
0L, 0L), d92 = c(0L, 0L, 0L), d93 = c(0L, 0L, 0L), d94 = c(0L, 
0L, 0L), d95 = c(0L, 0L, 0L), d96 = c(0L, 0L, 0L), d97 = c(0L, 
0L, 0L), d98 = c(0L, 0L, 0L), d99 = c(0L, 0L, 0L), d00 = c(0L, 
0L, 0L), d01 = c(0L, 0L, 0L), d02 = c(0L, 0L, 0L), d03 = c(0L, 
0L, 0L), d04 = c(0L, 0L, 0L)), row.names = c("1", "2", "3"), class = "data.frame")

我的最终目标是计算每年给定列的平均值,因此我想添加一列,其中每行的值等于该行的年份。换句话说,我想将一组年指标列折叠为一个列。例如,上面的数据将变为

80
81
82
83
84

执行此操作的最佳方法是什么?谢谢您的帮助!

1 个答案:

答案 0 :(得分:0)

假设数据集为df,则可以使用以下方法:

library(tidyverse)

df %>%
  group_by(id = row_number()) %>%     # for every row numer (row id)
  nest() %>%                          # nest data
  mutate(year = map(data, ~as.numeric(gsub("d", "", names(.)[.==1])))) %>%  # keep the column name of value 1, remove "d" and make the value numeric 
  unnest() %>%                        # unnest data
  select(-id)                         # remove row id


# # A tibble: 3 x 26
#    year   d80   d81   d82   d83   d84   d85   d86   d87   d88   d89   d90   d91   d92   d93
#   <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1    80     1     0     0     0     0     0     0     0     0     0     0     0     0     0
# 2    81     0     1     0     0     0     0     0     0     0     0     0     0     0     0
# 3    82     0     0     1     0     0     0     0     0     0     0     0     0     0     0
# # ... with 11 more variables: d94 <int>, d95 <int>, d96 <int>, d97 <int>, d98 <int>, d99 <int>,
# #   d00 <int>, d01 <int>, d02 <int>, d03 <int>, d04 <int>

新列称为year,它位于数据集的开头。

另一种方法是进行一些重塑和合并:

library(tidyverse)

# add a row id (useful for reshaping after)
df = df %>% mutate(id = row_number())

df %>%
  gather(year, value, -id) %>%   # reshape data
  filter(value == 1) %>%         # keep 1s
  mutate(year = as.numeric(gsub("d", "", year))) %>%  # update year value
  left_join(df, by="id") %>%     # join back original dataset
  select(-id, -value)            # remove unnecessary columns


#   year d80 d81 d82 d83 d84 d85 d86 d87 d88 d89 d90 d91 d92 d93 d94 d95 d96 d97 d98 d99 d00 d01
# 1   80   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
# 2   81   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
# 3   82   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
#   d02 d03 d04
# 1   0   0   0
# 2   0   0   0
# 3   0   0   0

一个基本的R解决方案应该是

df$year = as.numeric(gsub("d", "", apply(df , 1, function(x) names(x)[x==1])))