处理数据以具有总计数以及每年的各个计数

时间:2019-04-01 18:47:15

标签: r

我有此数据,其中每一行是一年,其中分别包含“最佳图片”,“最佳男主角”和“最佳女主角”演讲的详细信息。

我需要更改数据集,以便每年有3行,其中新的列类型标识与语音“类型”相对应的行(请参见下面的输出)。另外,还要将thankM和thankW一起添加

## wcnt:    number of words in the Best Picture acceptance speech
## year:    movie release year (broadcast occurs in year+1)
## budget:  total unadjusted budget in US dollars
## inflate: Inflation rate with respect to Fall 2018
## thanksP: number of "thanks" in the Best Picture speech
## man:     number of words in the Best Leading Actor speech
## woman:   number of words in the Best Leading Actress speech
## thanksM: number of "thanks" in the Best Leading Actor speech
## thanksW: number of "thanks" in the Best Leadeing Actress speech
oscars<- read.table(header=T, sep=",", text="
 wcnt, year,    budget, inflate, thanksP,  man,   woman,  thanksM,  thanksW,  time
  212, 1942  , 1344000 ,  16.06,  3,       101,     452   ,     1    ,    2 ,  108
  119, 1946   ,2100000  , 13.85,  1,        56,     218   ,     2   ,     1 ,  101
  176, 1947   ,2000000   ,11.73,  5,        96,     220   ,     1   ,     1 ,  172
   50, 1949  ,       0   ,10.51,  4,        29 ,     31   ,     3   ,     1  , 118
   34, 1950  , 1400000,   10.73,  4 ,      208  ,    46   ,     3   ,     1   ,110
   31, 1951  , 2723903,    9.93,  3 ,       73   ,   43   ,     1   ,     1   ,138
  156, 1952  , 4000000,    9.51,  3 ,      159    , 100   ,     0   ,     4 ,  113
   97, 1953  , 1650000,    9.48,  3 ,        4,      33   ,     2   ,     1 ,   93
   46, 1954  ,  910000,    9.37,  1  ,      64,      33    ,    1    ,    2 ,  118
   70, 1955  ,  343000,    9.44,  1  ,      61,      71   ,     4   ,    1  , 108
   35, 1956  , 6000000,    9.41,  2 ,       22 ,    132   ,     1   ,     3 ,   90
   91, 1957  , 3000000,    9.14,  1  ,      79,      41   ,     2   ,     3 ,  188
   20, 1958  , 3319355,    8.82,  1   ,     36 ,     39   ,     2    ,    4 ,  161
   81, 1959  ,15900000,    8.69,  1    ,   131,      78   ,     3  ,      4 ,  115
   70, 1960  , 3000000 ,   8.61,  1     ,   76 ,     30   ,     3 ,       2 ,  125
  125, 1961  , 6000000,    8.46,  2      , 104,      71   ,     1 ,       0 ,  130
   90, 1962  ,15000000 ,   8.40,  2 ,       74  ,    28    ,    5  ,      1  , 150
   64, 1963  , 1000000,    8.29,  1 ,       52 ,     55   ,     1 ,       3 ,  128
  159, 1964  ,17000000,    8.16,  6  ,      81  ,    97   ,     2 ,       6  , 170
   69, 1965  , 8200000,    8.08 , 4   ,     46   ,   24   ,     4 ,       2 ,  174
    4, 1966  , 2000000,    7.93 , 1    ,    62   ,   36   ,     1 ,       2 ,  151
   99, 1967  , 2000000  ,  7.66 , 3     ,  120  ,    44   ,    11 ,       2 ,  110
   62, 1968  ,10000000  ,  7.39 , 2      ,  44  ,    50   ,     2 ,       1 ,  153
   37, 1969  , 3600000  ,  7.08 , 3       ,127  ,    74   ,     3 ,       2 ,  145
   51, 1970  ,12000000,    6.67  ,5       , 44  ,    41   ,     0 ,       2 ,  172
   66, 1971  , 1800000,    6.34 , 2 ,      143  ,    41   ,     5 ,       4 ,  104
  217, 1972  , 6000000,    6.13 , 2  ,     141  ,    58   ,     1 ,       4 ,  158
  127, 1973  , 5500000  ,  5.92 , 4   ,    240  ,   119   ,     3 ,       5  , 203
   73, 1974  ,13000000  ,  5.41 , 7    ,    59  ,    57   ,     3 ,       4 ,  200
  236, 1975  , 4400000  ,  4.84 , 3     ,  106  ,   131   ,     3 ,       3 ,  192
  125, 1976  ,  960000  ,  4.53 , 5      , 193  ,    82   ,     7 ,       4 ,  218
  216, 1977  , 4000000  ,  4.31 , 3 ,       77  ,    60   ,     1 ,       3 ,  210
   68, 1978  ,15000000  ,  4.03 , 5  ,     317  ,   367   ,     8 ,      11  , 215
  208, 1979  , 8000000   , 3.69 , 1   ,    362  ,   287   ,     4 ,       3   ,192
  162, 1980  , 6000000   , 3.24 , 5    ,   240  ,   137   ,     3  ,      2 ,  193
  188, 1981  , 5500000,    2.90 , 4     ,  590  ,     0   ,     6   ,     0 ,  204
  427, 1982  ,22000000,    2.67 , 1      , 123  ,   231   ,     1   ,     6 ,  195
  192, 1983  , 8000000,    2.58 , 2       ,265  ,   359   ,     3    ,    3 ,  222
  248, 1984  ,18000000  ,  2.47 , 4,       127  ,   144   ,     1     ,   2 ,  190
   48, 1985  ,31000000  ,  2.39 , 3 ,       55  ,   119   ,     2      ,  5 ,  182
  279, 1986  , 6000000  ,  2.30 , 5  ,      97  ,   104   ,     1       , 5 ,  199
  118, 1987  ,23000000  , 2.27  , 4   ,    316  ,   184   ,     8 ,       5 ,  213
  207, 1988  ,25000000   , 2.18 , 5    ,   326  ,   140   ,    11  ,      3 ,  199
  213, 1989  , 7500000   , 2.08 , 9     ,  111  ,   100   ,     1   ,     2 ,  217
  258, 1990  ,22000000   , 1.98 , 3      , 126  ,   189   ,     8    ,    9 ,  215
  236, 1991  ,19000000  ,  1.87 , 7       ,159  ,   278   ,    3      ,  9  , 213
  123, 1992  ,14400000   , 1.83 , 5,       472  ,   185   ,    11      ,  3 ,  210
  282, 1993  ,22000000  ,  1.77,  8 ,      414  ,   264   ,     0 ,       5 ,  198
  423, 1994  ,55000000   , 1.72,  9  ,     228  ,   201   ,     3  ,      3 ,  215
  145, 1995  ,72000000  ,  1.68,  9   ,    184  ,   317   ,     4   ,    12 ,  218
  243, 1996  ,27000000  ,  1.63,  6    ,   226  ,   200   ,     5    ,    1 ,  214
  594, 1997 ,200000000  ,  1.58,  5     ,  193  ,   271   ,     3     ,   6 ,  227
  386, 1998  ,25000000  ,  1.56,  8      , 198  ,   363   ,     7      , 11 ,  242
  321, 1999  ,15000000  ,  1.53,  9       ,260  ,   385   ,     7 ,       9 ,  249
  314, 2000 ,103000000  ,  1.49,  10,      253  ,   396   ,     4  ,      5 ,  203
  378 ,2001,  58000000  ,  1.44 , 11 ,     302  ,   528   ,     4   ,    32 ,  263
  232, 2002,  45000000  ,  1.42,  2    ,   462   ,  234    ,   10     ,   2  , 210
  436, 2003,  94000000   , 1.39 , 4    ,   139  ,   287    ,    3     ,  15 ,  224
  265, 2004,  30000000  ,  1.36 , 6     ,  490  ,   354   ,    15      , 11 ,  194
  193, 2005,   6500000  ,  1.32 , 12     , 208  ,   436   ,     8 ,      11 ,  213
  257, 2006,  90000000  ,  1.27 , 8       ,297  ,   192   ,     8  ,      6 ,  231
  181, 2007,  25000000  ,  1.25 , 6 ,      199  ,    72   ,     6   ,     6 ,  201
  241, 2008,  15000000  ,  1.19 , 5  ,     300  ,   328   ,     4     ,   4 ,  210
  271, 2009,  15000000  ,  1.19 , 8   ,    302  ,   468   ,    12    ,   11 ,  217
  273, 2010,  15000000  ,  1.16 , 9    ,   319  ,   361   ,     2      ,  6 ,  195
  263, 2011,  15000000  ,  1.14 , 8     ,  122  ,   270   ,     7 ,      11 ,  194
  634, 2012,  44500000  ,  1.11 , 22     , 254  ,   118   ,     2 ,       7 ,  215
  380, 2013,  20000000  ,  1.09 , 14      ,549  ,   513   ,    12 ,      11 ,  214
  431, 2014,  18000000  ,  1.08 , 10      ,195  ,   324   ,     5 ,       8 ,  223
  148, 2015,  20000000  ,  1.08  ,4,       402  ,   178   ,    10  ,     10 ,  217
  283, 2016,   1500000  ,  1.06 , 9 ,      218  ,   294   ,     4  ,     9  , 229
  213, 2017,  19400000  ,  1.04 , 4  ,     293  ,   264   ,     8 ,       3 ,  233")
year words thanks     type
1942 212     3        BestPicture
1942 101     1        Actor
1942 452     2        Actress
1946 119     1        BestPicture
1946 56      2         Actor
1946 218     1        Actress
1947 176     5        BestPicture
1947 96      1         Actor
1947 220     1        Actress


2 个答案:

答案 0 :(得分:0)

我们可以使用melt中的data.table

library(data.table)

DT <- setDT(oscars)
setnames(DT, c("wcnt", "man", "woman"), c("wcntP", "wcntM", "wcntW"))

output <- melt(DT[, .SD, .SDcols = names(DT) %like% "year|^thanks|^wcnt"], 
               id.vars = "year", measure.vars = patterns("^thanks", "^wcnt"), 
               variable.name = "type", value.name = c("thanks", "words"))[order(year)]
levels(output$type) = c("BestPicture", "Actor", "Actress")

输出:

     year        type thanks words
  1: 1942 BestPicture      3   212
  2: 1942       Actor      1   101
  3: 1942     Actress      2   452
  4: 1946 BestPicture      1   119
  5: 1946       Actor      2    56
 ---                              
212: 2016       Actor      4   218
213: 2016     Actress      9   294
214: 2017 BestPicture      4   213
215: 2017       Actor      8   293
216: 2017     Actress      3   264

我们也可以使用gatherdplyr中的tidyr,但效率似乎不及data.table::melt

library(dplyr)
library(tidyr)

oscars %>%
  select(year, starts_with("thanks"), wcnt, man, woman) %>%
  gather(type, thanks, starts_with("thanks")) %>%
  gather(type2, words, wcnt, man, woman) %>%
  arrange(year) %>%
  filter((type == "thanksP" & type2 == "wcnt") | 
           (type == "thanksM" & type2 == "man") | 
           (type == "thanksW" & type2 == "woman")) %>%
  mutate(type = case_when(type == "thanksP" ~ "BestPicture",
                          type == "thanksM" ~ "Actor",
                          TRUE ~ "Actress")) %>%
  select(year, words, thanks, type)

输出:

    year words thanks        type
1   1942   212      3 BestPicture
2   1942   101      1       Actor
3   1942   452      2     Actress
4   1946   119      1 BestPicture
5   1946    56      2       Actor
6   1946   218      1     Actress
7   1947   176      5 BestPicture
8   1947    96      1       Actor
9   1947   220      1     Actress
10  1949    50      4 BestPicture
11  1949    29      3       Actor
12  1949    31      1     Actress
13  1950    34      4 BestPicture
14  1950   208      3       Actor
15  1950    46      1     Actress
16  1951    31      3 BestPicture
17  1951    73      1       Actor
18  1951    43      1     Actress
19  1952   156      3 BestPicture
20  1952   159      0       Actor
...

答案 1 :(得分:0)

另一种tidyverse可能性是:

bind_cols(oscars %>%
 select(-budget, -inflate, -time, -contains("thanks")) %>%
 gather(type, words, -c(year)) %>%
 mutate(type = ifelse(type == "wcnt", "BestPicture",
                      ifelse(type == "man", "Actor", "Actress"))) %>%
 arrange(year, type), oscars %>%
 select(-budget, -inflate, -time, -wcnt, -man, -woman) %>%
 gather(temp, thanks, -c(year)) %>%
 mutate(temp = ifelse(temp == "thanksP", "BestPicture",
                      ifelse(temp == "thanksM", "Actor", "Actress"))) %>%
 arrange(year, temp) %>%
 select(-year, -temp))

     year        type words thanks
1   1942       Actor   101      1
2   1942     Actress   452      2
3   1942 BestPicture   212      3
4   1946       Actor    56      2
5   1946     Actress   218      1
6   1946 BestPicture   119      1
7   1947       Actor    96      1
8   1947     Actress   220      1
9   1947 BestPicture   176      5
10  1949       Actor    29      3
11  1949     Actress    31      1
12  1949 BestPicture    50      4