混合来自不同格式的数据框架的值

时间:2015-05-19 11:34:41

标签: r

我有一个包含以下列的数据库:“Year”,“Month”,“T1”,......“T31”:

例如df_0是原始格式,我想在new_df中转换它(第二部分)

id0 <- c ("Year", "Month", "T_day1", "T_day2", "T_day3", "T_day4", "T_day5")
id1 <- c ("2010", "January", 10, 5, 2,3,3)
id2 <- c ("2010", "February", 20,36,5,8,1)
id3 <- c ("2010", "March", 12,23,23,5,25)
df_0 <- rbind (id1, id2, id3)
colnames (df_0)<- id0
head(df_0)

我想创建一个新的数据框,其中每个月和每年的T1 .... T31数据将加入一个包含所有日期的列,例如2010年1月1日到2012年1月4日:

date<-seq(as.Date("2010-01-01"), as.Date("2012-01-04"), by="days")

或根据其他三列(年,月,日)的值将值加入数据框的新列:

year <- lapply(strsplit(as.character(date), "\\-"), "[", 1)
month <- lapply(strsplit(as.character(date), "\\-"), "[", 2)
day <- lapply(strsplit(as.character(date), "\\-"), "[", 3)
df <- cbind (year, month, day)

我希望以这种方式获得包含信息的数据框:

Year <- rep(2010,15)
Month <- c(rep("January", 5), rep("February",5), rep("March",5))
Day<- rep(c(1,2,3,4,5))
Value <- c(10,5,2,3,3,20,36,5,8,1,12,23,23,5,25)
new_df <- cbind (Year, Month, Day, Value)
head(new_df)

提前致谢

2 个答案:

答案 0 :(得分:2)

您正在寻找的是重塑您的数据。您可以使用的一个库是reshape2库。在这里,我们可以使用melt库中的reshape2函数:

melt(data.frame(df_0), id.vars=c("Year", "Month"))

根据您拥有的数据,输出将具有:

   Year    Month variable value
1  2010  January   T_day1    10
2  2010 February   T_day1    20
3  2010    March   T_day1    12
4  2010  January   T_day2     5
5  2010 February   T_day2    36
6  2010    March   T_day2    23
7  2010  January   T_day3     2
8  2010 February   T_day3     5
9  2010    March   T_day3    23
10 2010  January   T_day4     3
11 2010 February   T_day4     8
12 2010    March   T_day4     5
13 2010  January   T_day5     3
14 2010 February   T_day5     1
15 2010    March   T_day5    25

然后,您可以根据格式化该列的方式将变量列更改为天数。

答案 1 :(得分:1)

首先,我生成了自己的测试数据。我使用了简化的date向量来简化演示:2010-01-012010-03-04。在我的df_0我为缩小日期向量中的每个日期生成了一个值,不包括上一个日期,并且还包括一个不在date向量中的其他日期:2010-03-05。后来为什么我会这样做会很清楚。

set.seed(1);
date <- seq(as.Date('2010-01-01'),as.Date('2010-03-04'),by='day');
df_0 <- reshape(setNames(as.data.frame(cbind(do.call(rbind,strsplit(strftime(c(date[-length(date)],as.Date('2010-03-05')),'%Y %B %d'),' ')),round(rnorm(length(date)),3))),c('Year','Month','Day','T_day')),dir='w',idvar=c('Year','Month'),timevar='Day');
attr(df_0,'reshapeWide') <- NULL;
df_0;
##    Year    Month T_day.01 T_day.02 T_day.03 T_day.04 T_day.05 T_day.06 T_day.07 T_day.08 T_day.09 T_day.10 T_day.11 T_day.12 T_day.13 T_day.14 T_day.15 T_day.16 T_day.17 T_day.18 T_day.19 T_day.20 T_day.21 T_day.22 T_day.23 T_day.24 T_day.25 T_day.26 T_day.27 T_day.28 T_day.29 T_day.30 T_day.31
## 1  2010  January   -0.626    0.184   -0.836    1.595     0.33    -0.82    0.487    0.738    0.576   -0.305    1.512     0.39   -0.621   -2.215    1.125   -0.045   -0.016    0.944    0.821    0.594    0.919    0.782    0.075   -1.989     0.62   -0.056   -0.156   -1.471   -0.478    0.418    1.359
## 32 2010 February   -0.103    0.388   -0.054   -1.377   -0.415   -0.394   -0.059      1.1    0.763   -0.165   -0.253    0.697    0.557   -0.689   -0.707    0.365    0.769   -0.112    0.881    0.398   -0.612    0.341   -1.129    1.433     1.98   -0.367   -1.044     0.57     <NA>     <NA>     <NA>
## 60 2010    March   -0.135    2.402   -0.039     <NA>     0.69     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>

解决方案的前半部分是从宽格式到长格式的重塑,只需调用reshape()即可完成。另外,我将其打包到na.omit(),以防止df_0中不可避免的NA单元格生成NA值:

df_1 <- na.omit(reshape(df_0,dir='l',idvar=c('Year','Month'),timevar='Day',varying=grep('^T_day\\.',names(df_0)),v.names='Value'));
rownames(df_1) <- NULL;
df_1[order(match(df_1$Month,month.name),df_1$Day),];
##    Year    Month Day  Value
## 1  2010  January   1 -0.626
## 4  2010  January   2  0.184
## 7  2010  January   3 -0.836
## 10 2010  January   4  1.595
## 12 2010  January   5   0.33
## 15 2010  January   6  -0.82
## 17 2010  January   7  0.487
## 19 2010  January   8  0.738
## 21 2010  January   9  0.576
## 23 2010  January  10 -0.305
## 25 2010  January  11  1.512
## 27 2010  January  12   0.39
## 29 2010  January  13 -0.621
## 31 2010  January  14 -2.215
## 33 2010  January  15  1.125
## 35 2010  January  16 -0.045
## 37 2010  January  17 -0.016
## 39 2010  January  18  0.944
## 41 2010  January  19  0.821
## 43 2010  January  20  0.594
## 45 2010  January  21  0.919
## 47 2010  January  22  0.782
## 49 2010  January  23  0.075
## 51 2010  January  24 -1.989
## 53 2010  January  25   0.62
## 55 2010  January  26 -0.056
## 57 2010  January  27 -0.156
## 59 2010  January  28 -1.471
## 61 2010  January  29 -0.478
## 62 2010  January  30  0.418
## 63 2010  January  31  1.359
## 2  2010 February   1 -0.103
## 5  2010 February   2  0.388
## 8  2010 February   3 -0.054
## 11 2010 February   4 -1.377
## 13 2010 February   5 -0.415
## 16 2010 February   6 -0.394
## 18 2010 February   7 -0.059
## 20 2010 February   8    1.1
## 22 2010 February   9  0.763
## 24 2010 February  10 -0.165
## 26 2010 February  11 -0.253
## 28 2010 February  12  0.697
## 30 2010 February  13  0.557
## 32 2010 February  14 -0.689
## 34 2010 February  15 -0.707
## 36 2010 February  16  0.365
## 38 2010 February  17  0.769
## 40 2010 February  18 -0.112
## 42 2010 February  19  0.881
## 44 2010 February  20  0.398
## 46 2010 February  21 -0.612
## 48 2010 February  22  0.341
## 50 2010 February  23 -1.129
## 52 2010 February  24  1.433
## 54 2010 February  25   1.98
## 56 2010 February  26 -0.367
## 58 2010 February  27 -1.044
## 60 2010 February  28   0.57
## 3  2010    March   1 -0.135
## 6  2010    March   2  2.402
## 9  2010    March   3 -0.039
## 14 2010    March   5   0.69

解决方案的第二部分需要将上述长格式data.frame与您在结果data.frame中所需的确切日期合并。这需要相当数量的脚手架代码才能将日期向量转换为带有Year Month Day列的data.frame,但一旦完成,您只需使用all.x=T调用merge()即可保留每个日期在日期向量中是否存在于df_1中,并排除df_1中日期向量中不存在的任何日期:

df_2 <- merge(transform(setNames(as.data.frame(do.call(rbind,strsplit(strftime(date,'%Y %B %d'),' '))),c('Year','Month','Day')),Day=as.integer(Day)),df_1,all.x=T);
df_2[order(match(df_2$Month,month.name),df_2$Day),];
##    Year    Month Day  Value
## 29 2010  January   1 -0.626
## 30 2010  January   2  0.184
## 31 2010  January   3 -0.836
## 32 2010  January   4  1.595
## 33 2010  January   5   0.33
## 34 2010  January   6  -0.82
## 35 2010  January   7  0.487
## 36 2010  January   8  0.738
## 37 2010  January   9  0.576
## 38 2010  January  10 -0.305
## 39 2010  January  11  1.512
## 40 2010  January  12   0.39
## 41 2010  January  13 -0.621
## 42 2010  January  14 -2.215
## 43 2010  January  15  1.125
## 44 2010  January  16 -0.045
## 45 2010  January  17 -0.016
## 46 2010  January  18  0.944
## 47 2010  January  19  0.821
## 48 2010  January  20  0.594
## 49 2010  January  21  0.919
## 50 2010  January  22  0.782
## 51 2010  January  23  0.075
## 52 2010  January  24 -1.989
## 53 2010  January  25   0.62
## 54 2010  January  26 -0.056
## 55 2010  January  27 -0.156
## 56 2010  January  28 -1.471
## 57 2010  January  29 -0.478
## 58 2010  January  30  0.418
## 59 2010  January  31  1.359
## 1  2010 February   1 -0.103
## 2  2010 February   2  0.388
## 3  2010 February   3 -0.054
## 4  2010 February   4 -1.377
## 5  2010 February   5 -0.415
## 6  2010 February   6 -0.394
## 7  2010 February   7 -0.059
## 8  2010 February   8    1.1
## 9  2010 February   9  0.763
## 10 2010 February  10 -0.165
## 11 2010 February  11 -0.253
## 12 2010 February  12  0.697
## 13 2010 February  13  0.557
## 14 2010 February  14 -0.689
## 15 2010 February  15 -0.707
## 16 2010 February  16  0.365
## 17 2010 February  17  0.769
## 18 2010 February  18 -0.112
## 19 2010 February  19  0.881
## 20 2010 February  20  0.398
## 21 2010 February  21 -0.612
## 22 2010 February  22  0.341
## 23 2010 February  23 -1.129
## 24 2010 February  24  1.433
## 25 2010 February  25   1.98
## 26 2010 February  26 -0.367
## 27 2010 February  27 -1.044
## 28 2010 February  28   0.57
## 60 2010    March   1 -0.135
## 61 2010    March   2  2.402
## 62 2010    March   3 -0.039
## 63 2010    March   4   <NA>

请注意2010-03-04是如何包含的,即使我没有在df_0中为其生成值,也排除了2010-03-05,即使我这样做了。