一个col中的变量;另一个值 - >目标:为变量添加列

时间:2016-12-22 21:11:43

标签: r data-structures dplyr plyr

我认为我正面临一个(希望)小问题,但搜索功能并没有为我提供任何帮助。我在通过OECD软件包提取数据时遇到了麻烦。事情是,我得到一个数据集,其中所有变量都存储在一列中。数据集采用长格式,这很好,但我希望变量成为单列。目前数据集看起来像这样:

enter image description here

正如您所看到的,“VAR”列包含几个变量:“B11”,“B12”......所有11个变量都包含在内。所有变量都是针对许多国家(Col“COU”)进行测量的。我想要做的是,在数据集中添加新列,表示现在存储在“VAR”中的单个变量,并包含“obsValue”列的相应值吗?

这样我就可以看到B11的值,例如对于1999年的阿富汗,2000年的另一个,但1999年的B12值与B11的同一行等等。我希望我的目标明确,如果没有,请不要犹豫。

以下是重现数据集头部的代码:

dput(head(MIG,20)) 

structure(list(CO2 = c("AFG", "AFG", "AFG", "AFG", "AFG", "AFG", 
"AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", 
"AFG", "AFG", "AFG", "AFG", "AFG"), VAR = c("B11", "B11", "B11", 
"B11", "B11", "B11", "B11", "B11", "B11", "B11", "B11", "B11", 
"B11", "B11", "B11", "B11", "B12", "B12", "B12", "B12"), GEN = c("WMN", 
"WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN", 
"WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN", 
"WMN"), COU = c("AUS", "AUS", "AUS", "AUS", "AUS", "AUS", "AUS", 
"AUS", "AUS", "AUS", "AUS", "AUS", "AUS", "AUS", "AUS", "AUS", 
"AUS", "AUS", "AUS", "AUS"), TIME_FORMAT = c("P1Y", "P1Y", "P1Y", 
"P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", 
"P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y"), obsTime = c("1999", 
"2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", 
"2008", "2009", "2010", "2011", "2012", "2013", "2014", "1999", 
"2000", "2001", "2004"), obsValue = c(434, 398, 225, 345, 544, 
726, 1099, 1607, 1377, 1018, 946, 873, 1131, 903, 1230, 2939, 
0, 0, 2, 24), OBS_STATUS = c(NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_), migrants = c(434, 398, 225, 345, 
544, 726, 1099, 1607, 1377, 1018, 946, 873, 1131, 903, 1230, 
2939, 0, 0, 2, 24)), .Names = c("CO2", "VAR", "GEN", "COU", "TIME_FORMAT", 
"obsTime", "obsValue", "OBS_STATUS", "migrants"), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

这是我的整个代码,包括我自己解决问题的两次尝试,这些尝试不起作用,因为他们只是复制“obsValue”列或给我一个表示TRUE或FALSE的列。请注意,R将需要相当多的时间来加载数据集。

library(OECD)
library(plyr)
library(dplyr)

search_dataset("migration")
MIG<- get_dataset("MIG")
get_data_structure("MIG")

MIG$migrants <- if(MIG$VAR == "B11")MIG$migrants<-MIG$obsValue else MIG$migrants<-NA


MIG_long <- mutate(MIG,migrants=VAR=="B11")
if(MIG_long$migrants==T)MIG_long$migrants<-MIG_long$obsValue else MIG_long$migrants<-NA

我希望这个问题对你来说并不低,你可以“解释”我的工作。不过,如果您有任何疑问,请问我。

祝福, 马塞尔

1 个答案:

答案 0 :(得分:2)

您可以使用tidyr spread VARobsValue列。如果您确实希望每行一年,因为@atiretoo已经过高,您只需删除migrants列即可获得每年的唯一值。

library(tidyr)
library(dplyr)

MIG %>% 
  select(-migrants) %>%
  spread(VAR, obsValue)

     CO2 obsTime   B11   B12
   (chr)   (chr) (dbl) (dbl)
1    AFG    1999   434     0
2    AFG    2000   398     0
3    AFG    2001   225     2
4    AFG    2002   345    NA
5    AFG    2003   544    NA
6    AFG    2004   726    24
7    AFG    2005  1099    NA
8    AFG    2006  1607    NA
9    AFG    2007  1377    NA
10   AFG    2008  1018    NA
11   AFG    2009   946    NA
12   AFG    2010   873    NA
13   AFG    2011  1131    NA
14   AFG    2012   903    NA
15   AFG    2013  1230    NA
16   AFG    2014  2939    NA