我认为我正面临一个(希望)小问题,但搜索功能并没有为我提供任何帮助。我在通过OECD软件包提取数据时遇到了麻烦。事情是,我得到一个数据集,其中所有变量都存储在一列中。数据集采用长格式,这很好,但我希望变量成为单列。目前数据集看起来像这样:
正如您所看到的,“VAR”列包含几个变量:“B11”,“B12”......所有11个变量都包含在内。所有变量都是针对许多国家(Col“COU”)进行测量的。我想要做的是,在数据集中添加新列,表示现在存储在“VAR”中的单个变量,并包含“obsValue”列的相应值吗?
这样我就可以看到B11的值,例如对于1999年的阿富汗,2000年的另一个,但1999年的B12值与B11的同一行等等。我希望我的目标明确,如果没有,请不要犹豫。
以下是重现数据集头部的代码:
dput(head(MIG,20))
structure(list(CO2 = c("AFG", "AFG", "AFG", "AFG", "AFG", "AFG",
"AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",
"AFG", "AFG", "AFG", "AFG", "AFG"), VAR = c("B11", "B11", "B11",
"B11", "B11", "B11", "B11", "B11", "B11", "B11", "B11", "B11",
"B11", "B11", "B11", "B11", "B12", "B12", "B12", "B12"), GEN = c("WMN",
"WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN",
"WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN",
"WMN"), COU = c("AUS", "AUS", "AUS", "AUS", "AUS", "AUS", "AUS",
"AUS", "AUS", "AUS", "AUS", "AUS", "AUS", "AUS", "AUS", "AUS",
"AUS", "AUS", "AUS", "AUS"), TIME_FORMAT = c("P1Y", "P1Y", "P1Y",
"P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y",
"P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y"), obsTime = c("1999",
"2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007",
"2008", "2009", "2010", "2011", "2012", "2013", "2014", "1999",
"2000", "2001", "2004"), obsValue = c(434, 398, 225, 345, 544,
726, 1099, 1607, 1377, 1018, 946, 873, 1131, 903, 1230, 2939,
0, 0, 2, 24), OBS_STATUS = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_), migrants = c(434, 398, 225, 345,
544, 726, 1099, 1607, 1377, 1018, 946, 873, 1131, 903, 1230,
2939, 0, 0, 2, 24)), .Names = c("CO2", "VAR", "GEN", "COU", "TIME_FORMAT",
"obsTime", "obsValue", "OBS_STATUS", "migrants"), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
这是我的整个代码,包括我自己解决问题的两次尝试,这些尝试不起作用,因为他们只是复制“obsValue”列或给我一个表示TRUE或FALSE的列。请注意,R将需要相当多的时间来加载数据集。
library(OECD)
library(plyr)
library(dplyr)
search_dataset("migration")
MIG<- get_dataset("MIG")
get_data_structure("MIG")
MIG$migrants <- if(MIG$VAR == "B11")MIG$migrants<-MIG$obsValue else MIG$migrants<-NA
MIG_long <- mutate(MIG,migrants=VAR=="B11")
if(MIG_long$migrants==T)MIG_long$migrants<-MIG_long$obsValue else MIG_long$migrants<-NA
我希望这个问题对你来说并不低,你可以“解释”我的工作。不过,如果您有任何疑问,请问我。
祝福, 马塞尔
答案 0 :(得分:2)
您可以使用tidyr
spread
VAR
和obsValue
列。如果您确实希望每行一年,因为@atiretoo已经过高,您只需删除migrants
列即可获得每年的唯一值。
library(tidyr)
library(dplyr)
MIG %>%
select(-migrants) %>%
spread(VAR, obsValue)
CO2 obsTime B11 B12
(chr) (chr) (dbl) (dbl)
1 AFG 1999 434 0
2 AFG 2000 398 0
3 AFG 2001 225 2
4 AFG 2002 345 NA
5 AFG 2003 544 NA
6 AFG 2004 726 24
7 AFG 2005 1099 NA
8 AFG 2006 1607 NA
9 AFG 2007 1377 NA
10 AFG 2008 1018 NA
11 AFG 2009 946 NA
12 AFG 2010 873 NA
13 AFG 2011 1131 NA
14 AFG 2012 903 NA
15 AFG 2013 1230 NA
16 AFG 2014 2939 NA