如何使用tidyverse操纵此数据框

时间:2019-06-18 09:44:07

标签: r tidyverse tidyr

我的数据如下:


   Year Categories January   February March April      May      June      July    August September   October   November     December
1  1990          A  4564.0   465465.0    12   468   4884.0  12788.00   4218.00 -58445.86 -90643.00 -122840.1 -155037.29 -187234.4286
2  1990          B  6487.0   421214.0   878  2112 421283.0  56456.00  54654.00    515.00    212.00     515.0     212.00     515.0000
3  1990          C 42862.0      512.0   484    48    515.0    212.00    515.00 137858.33     48.00  137858.3      48.00     465.0000
4  1990          D    15.0  -169222.7    90   456 137858.3     48.00    465.00 135673.83    778.00  135673.8     778.00      12.0000
5  1990          E 19164.0  -401699.2  -304   246 135673.8    778.00     12.00 133489.33     57.00  133489.3      57.00     478.0000
6  1991          A 21436.8  -634175.7  -698    36 133489.3     57.00    478.00 131304.83      3.00  131304.8       3.00     331.3333
7  1991          B 23709.6  -866652.2 -1092  -174 131304.8      3.00  -8210.60 129120.33  30425.33  129120.3  -11463.57     337.8333
11 1992          A 32800.8 -1796558.2 -2668 -1014 122566.8 -27597.89 -29087.86 292051.00  82253.33  331147.5  -12728.17     363.8333
12 1992          B 35073.6 -2029034.7 -3062 -1224 120382.3 -32976.00 -34307.17 321333.47  95210.33  367329.4  -14420.56     370.3333
13 1992          C 37346.4 -2261511.2 -3456 -1434 118197.8 -38354.11 -39526.49 350615.94 108167.33  403511.2  -16112.96     376.8333

我想使用tidyverse如下操作此数据框:

首先,每年没有相同数量的类别。即使其他年份没有特定类别,也应显示所有其他类别。因为您看到90年代有5个类别,而91年代只有2个类别。

在这种情况下,应该并排查看几个月的数据,而不是逐行查看。因此,通过以下方式; 1月90日,2月90日,...,12月90日,1月91日,2月91日,..,12月91日,1月92日,...,12月92日(这些将显示为列名)。

我希望以此方式在专栏中看到它。年份应删除,并且唯一的类别应显示在最左列(类别下)。之后,如果某个类别不是特定于一年中的某个月份的,这意味着该月没有数据,则该月的以下月份可以为“ 0”。

为此,我想在R中使用tidyverse,但如果您能帮助我,我将无法将其编写为代码。

这是数据的预期版本,但正如我所说的那样,月份应该并排放置:

  Categories Jan.90    Feb.90 Mar.90 Apr.90   May.90 June.90 July.90    Aug.90 Sep.90    Oct.90    Nov.90    Dec.90  Jan.91    Feb.91 Mar.91
1          A   4564  465465.0     12    468   4884.0   12788    4218 -58445.86 -90643 -122840.1 -155037.3 -187234.4 21436.8 -634175.7   -698
2          B   6487  421214.0    878   2112 421283.0   56456   54654    515.00    212     515.0     212.0     515.0 23709.6 -866652.2  -1092
3          C  42862     512.0    484     48    515.0     212     515 137858.33     48  137858.3      48.0     465.0     0.0       0.0      0
4          D     15 -169222.7     90    456 137858.3      48     465 135673.83    778  135673.8     778.0      12.0     0.0       0.0      0
5          E  19164 -401699.2   -304    246 135673.8     778      12 133489.33     57  133489.3      57.0     478.0     0.0       0.0      0
  Apr.91   May.91 June.91 July.91   Aug.91   Sep.91   Oct.91    Nov.91   Dec.91  Jan.92   Feb.92 Mar.92 Apr.92   May.92   June.92   July.92
1     36 133489.3      57   478.0 131304.8     3.00 131304.8      3.00 331.3333 32800.8 -1796558  -2668  -1014 122566.8 -27597.89 -29087.86
2   -174 131304.8       3 -8210.6 129120.3 30425.33 129120.3 -11463.57 337.8333 35073.6 -2029035  -3062  -1224 120382.3 -32976.00 -34307.17
3      0      0.0       0     0.0      0.0     0.00      0.0      0.00   0.0000 37346.4 -2261511  -3456  -1434 118197.8 -38354.11 -39526.49
4      0      0.0       0     0.0      0.0     0.00      0.0      0.00   0.0000     0.0        0      0      0      0.0      0.00      0.00
5      0      0.0       0     0.0      0.0     0.00      0.0      0.00   0.0000     0.0        0      0      0      0.0      0.00      0.00
    Aug.92    Sep.92   Oct.92    Nov.92   Dec.92
1 292051.0  82253.33 331147.5 -12728.17 363.8333
2 321333.5  95210.33 367329.4 -14420.56 370.3333
3 350615.9 108167.33 403511.2 -16112.96 376.8333
4      0.0      0.00      0.0      0.00   0.0000
5      0.0      0.00      0.0      0.00   0.0000

2 个答案:

答案 0 :(得分:4)

您可以首先将数据gather group_by转换为长格式,Year completeCategories丢失的unite。然后,我们使用spread组合月份和年份组合,最后library(tidyverse) df %>% gather(key, value, -Year, -Categories) %>% group_by(Year) %>% complete(Categories) %>% unite(MonthYear, key, Year) %>% spread(MonthYear, value, fill = 0) # Categories April_1990 April_1991 April_1992 August_1990 .... # <fct> <dbl> <dbl> <dbl> <dbl> .... #1 A 468 36 -1014 -58446. .... #2 B 2112 -174 -1224 515 .... #3 C 48 0 -1434 137858. .... #4 D 456 0 0 135674. .... #5 E 246 0 0 133489. .... 通过将空值填充为0将其组合为宽格式。

df %>%
   gather(key, value, -Year, -Categories) %>%
   group_by(Year) %>%
   complete(Categories) %>%
   unite(MonthYear, key, Year) %>%
   mutate(MonthYear = factor(MonthYear, levels = unique(MonthYear))) %>%
   spread(MonthYear, value, fill = 0)


#  Categories January_1990 February_1990 March_1990 April_1990 ....
#  <chr>             <dbl>         <dbl>      <dbl>      <dbl> ....
#1 A                  4564       465465          12        468 ....
#2 B                  6487       421214         878       2112 ....
#3 C                 42862          512         484         48 ....
#4 D                    15      -169223.         90        456 ....
#5 E                 19164      -401699.       -304        246 ....

如果我们要保持列的顺序,一种简单的方法是将它们转换为因数

MonthYear

编辑

如OP对真实数据的评论中所述,它们会出现重复的标识符错误,因为我们可以在传播前为每个df %>% gather(key, value, -Year, -Categories) %>% group_by(Year) %>% complete(Categories) %>% unite(MonthYear, key, Year) %>% mutate(MonthYear = factor(MonthYear, levels = unique(MonthYear))) %>% group_by(MonthYear) %>% mutate(i = row_number()) %>% spread(MonthYear, value) %>% ungroup() %>% select(-i) 创建一个唯一索引

'react-native-device-info'

答案 1 :(得分:0)

如何聚会,然后一年又一个月粘贴在一起,然后传播。我使用一种荒谬的解决方法来保持列的顺序正确。试试:

library(dplyr)
library(tidyr)

df %>% 
  gather(k, v, -Year, -Categories, -Categories) %>% 
  arrange(Categories, Year) %>% 
  group_by(Categories) %>% 
  mutate(n = row_number(),
         col = paste0("n", 1000+n, substr(k, 1, 3), ".", substr(Year, 3, 4))) %>% 
  ungroup() %>% 
  arrange(col) %>% 
  select(-Year, -k, -n) %>% 
  spread(col, v, fill = 0) %>% 
  rename_at(vars(-Categories), ~substr(., 6, nchar(.)))

结果

# A tibble: 5 x 49
  Categories Jan.90  Feb.90 Mar.90 Apr.90 May.90 Jun.90 Jul.90  Aug.90 Sep.90  Oct.90  Nov.90  Dec.90 Jan.91 Jan.92  Feb.91  Feb.92 Mar.91 Mar.92 Apr.91 Apr.92 May.91
  <chr>       <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 A            4564  4.65e5     12    468 4.88e3  12788   4218 -58446. -90643 -1.23e5 -1.55e5 -1.87e5 21437.     0  -6.34e5  0.       -698      0     36      0 1.33e5
2 B            6487  4.21e5    878   2112 4.21e5  56456  54654    515     212  5.15e2  2.12e2  5.15e2 23710.     0  -8.67e5  0.      -1092      0   -174      0 1.31e5
3 C           42862  5.12e2    484     48 5.15e2    212    515 137858.     48  1.38e5  4.80e1  4.65e2     0  37346.  0.     -2.26e6      0  -3456      0  -1434 0.    
4 D              15 -1.69e5     90    456 1.38e5     48    465 135674.    778  1.36e5  7.78e2  1.20e1     0      0   0.      0.          0      0      0      0 0.    
5 E           19164 -4.02e5   -304    246 1.36e5    778     12 133489.     57  1.33e5  5.70e1  4.78e2     0      0   0.      0.          0      0      0      0 0.    
# … with 27 more variables: May.92 <dbl>, Jun.91 <dbl>, Jun.92 <dbl>, Jul.91 <dbl>, Jul.92 <dbl>, Aug.91 <dbl>, Aug.92 <dbl>, Sep.91 <dbl>, Sep.92 <dbl>, Oct.91 <dbl>,
#   Oct.92 <dbl>, Nov.91 <dbl>, Nov.92 <dbl>, Dec.91 <dbl>, Dec.92 <dbl>, Jan.92 <dbl>, Feb.92 <dbl>, Mar.92 <dbl>, Apr.92 <dbl>, May.92 <dbl>, Jun.92 <dbl>, Jul.92 <dbl>,
#   Aug.92 <dbl>, Sep.92 <dbl>, Oct.92 <dbl>, Nov.92 <dbl>, Dec.92 <dbl>

数据

df <- structure(list(Year = c(1990L, 1990L, 1990L, 1990L, 1990L, 1991L, 
                              1991L, 1992L, 1992L, 1992L), Categories = c("A", "B", "C", "D", 
                                                                          "E", "A", "B", "A", "B", "C"), January = c(4564, 6487, 42862, 
                                                                                                                     15, 19164, 21436.8, 23709.6, 32800.8, 35073.6, 37346.4), February = c(465465, 
                                                                                                                                                                                           421214, 512, -169222.7, -401699.2, -634175.7, -866652.2, -1796558.2, 
                                                                                                                                                                                           -2029034.7, -2261511.2), March = c(12L, 878L, 484L, 90L, -304L, 
                                                                                                                                                                                                                              -698L, -1092L, -2668L, -3062L, -3456L), April = c(468L, 2112L, 
                                                                                                                                                                                                                                                                                48L, 456L, 246L, 36L, -174L, -1014L, -1224L, -1434L), May = c(4884, 
                                                                                                                                                                                                                                                                                                                                              421283, 515, 137858.3, 135673.8, 133489.3, 131304.8, 122566.8, 
                                                                                                                                                                                                                                                                                                                                              120382.3, 118197.8), June = c(12788, 56456, 212, 48, 778, 57, 
                                                                                                                                                                                                                                                                                                                                                                            3, -27597.89, -32976, -38354.11), July = c(4218, 54654, 515, 
                                                                                                                                                                                                                                                                                                                                                                                                                       465, 12, 478, -8210.6, -29087.86, -34307.17, -39526.49), August = c(-58445.86, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           515, 137858.33, 135673.83, 133489.33, 131304.83, 129120.33, 292051, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           321333.47, 350615.94), September = c(-90643, 212, 48, 778, 57, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                3, 30425.33, 82253.33, 95210.33, 108167.33), October = c(-122840.1, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         515, 137858.3, 135673.8, 133489.3, 131304.8, 129120.3, 331147.5, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         367329.4, 403511.2), November = c(-155037.29, 212, 48, 778, 57, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           3, -11463.57, -12728.17, -14420.56, -16112.96), December = c(-187234.4286, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        515, 465, 12, 478, 331.3333, 337.8333, 363.8333, 370.3333, 376.8333
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           )), row.names = c(NA, -10L), class = c("tbl_df", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 "tbl", "data.frame"))