重塑数据框从长格式到宽格式

时间:2015-04-07 08:51:47

标签: r

我有这样的问题。我有一个像:

这样的数据库
Province       cases        year      month 
Newyork         10          2000         1
Newyork         20          2000         2
Newyork         30          2000         3
Newyork         40          2000         4
Los Angeles     30          2000         1
Los Angeles     40          2000         2
Los Angeles     50          2000         3
Los Angeles     60          2000         4

20年来和许多省份的非常大的数据。如何重新组合我的数据以获得这样的一系列时间:

Province      cases.at.1.2000  cases.at.2.2000  cases.at.3.2000  cases.at.4.2000  
Newyork             10               20                30               40
Los Angeles         30               40                50               60

3 个答案:

答案 0 :(得分:5)

只需使用dcast包中的reshape2

library(reshape2)

dcast(df, Province~month+year, value.var='cases')
#    Province 1_2000 2_2000 3_2000 4_2000
#1 LosAngeles     30     40     50     60
#2    Newyork     10     20     30     40

数据:

df=structure(list(Province = c("Newyork", "Newyork", "Newyork", 
"Newyork", "LosAngeles", "LosAngeles", "LosAngeles", "LosAngeles"
), cases = c(10L, 20L, 30L, 40L, 30L, 40L, 50L, 60L), year = c(2000L, 
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L), month = c(1L, 
2L, 3L, 4L, 1L, 2L, 3L, 4L)), .Names = c("Province", "cases", 
"year", "month"), class = "data.frame", row.names = c(NA, -8L
))

编辑:如果您错过了月/省,您仍然可以使用dcast

#     Province cases year month
#1     Newyork    10 2000     1
#2     Newyork    20 2000     2
#3     Newyork    30 2000     3
#4     Newyork    40 2000     4
#5  LosAngeles    30 2000     1
#6  LosAngeles    40 2000     2
#7  LosAngeles    50 2000     3
#8  LosAngeles    60 2000     4
#9     Newyork    99 2000     5
#10   SanDiego    99 2000     5

dcast(df, Province~month+year, value.var='cases')

#    Province 1_2000 2_2000 3_2000 4_2000 5_2000
#1 LosAngeles     30     40     50     60     NA
#2    Newyork     10     20     30     40     99
#3   SanDiego     NA     NA     NA     NA     99

答案 1 :(得分:2)

加入“#month”后,我们可以reshape使用base R。和'年'列(paste(...)

 reshape(
    transform(df1, yearmonth=paste('at', month, year, sep="."))[,-(3:4)], 
       idvar='Province', timevar='yearmonth', direction='wide')
#  Province cases.at.1.2000 cases.at.2.2000 cases.at.3.2000    cases.at.4.2000
# 1    Newyork              10              20              30              40
# 5 Los Angeles             30              40              50              60

数据

df1 <- structure(list(Province = c("Newyork", "Newyork", "Newyork", 
"Newyork", "Los Angeles", "Los Angeles", "Los Angeles", "Los Angeles"
), cases = c(10L, 20L, 30L, 40L, 30L, 40L, 50L, 60L), year = c(2000L, 
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L), month = c(1L, 
 2L, 3L, 4L, 1L, 2L, 3L, 4L)), .Names = c("Province", "cases", 
"year", "month"), class = "data.frame", row.names = c(NA, -8L))

答案 2 :(得分:0)

基于@Ananda Mahto的建议:

library(tidyr); library(dplyr)

df %>% mutate(month = paste0("cases.at.", month)) %>%  
  unite(key, month, year, sep=".") %>% spread(key, cases)

如果某个省缺少月 - 年,请使用展开:

df %>% expand(Province, year, month) %>% left_join(df) %>% 
  mutate(month = paste0("cases.at.", month)) %>%  
  unite(key, month, year, sep=".") %>% spread(key, cases)

数据:

df=structure(list(Province = c("Newyork", "Newyork", "Newyork", 
  "Newyork", "LosAngeles", "LosAngeles", "LosAngeles", "LosAngeles", "SanDiego"), 
  cases = c(10L, 20L, 30L, 40L, 30L, 40L, 50L, 60L, 90L), year = c(2000L, 
  2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L), month = c(1L, 
  2L, 3L, 4L, 1L, 2L, 3L, 4L, 4L)), .Names = c("Province", "cases", 
  "year", "month"), class = "data.frame", row.names = c(NA, -9L))