重塑具有从宽到长的多列的大型数据集

时间:2018-01-05 20:17:57

标签: r reshape

我有一个非常大的数据集,我需要从长到长重塑。

我的数据集看起来像:

  COMPANY   PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010 ... REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010 ... COSTSDEC2016
COMPANY A PRODUCT 1            6400           11050              6550               10600         8500        10400           9100             9850
COMPANY A PRODUCT 2            2700            3000              2800                3800         2850         2400           3100             3250
COMPANY B PRODUCT 3            5900            4150              5750                3750         4200         6100           2950             4600
COMPANY B PRODUCT 4             550             600                 0                 650          200          700            100              500
COMPANY B PRODUCT 5            1500            3750               550                2100         1850         1700           3150              450
COMPANY C PRODUCT 6           19300           17250             23600               21250        18200        26950          18200            23900

我希望它们看起来像:

  COMPANY    PRODUCT    DATE  REVENUES  COSTS
COMPANY A  PRODUCT 1  Dec-16     10600   9850
COMPANY A  PRODUCT 1  Feb-10     11050  10400
COMPANY A  PRODUCT 1  Jan-10      6400   8500
COMPANY A  PRODUCT 1  Mar-10      6550   9100
COMPANY A  PRODUCT 2  Dec-16      3800   3250
COMPANY A  PRODUCT 2  Feb-10      3000   2400
COMPANY A  PRODUCT 2  Jan-10      2700   2850
COMPANY A  PRODUCT 2  Mar-10      2800   3100
COMPANY B  PRODUCT 3  Dec-16      3750   4600
COMPANY B  PRODUCT 3  Feb-10      4150   6100
COMPANY B  PRODUCT 3  Jan-10      5900   4200
COMPANY B  PRODUCT 3  Mar-10      5750   2950
COMPANY B  PRODUCT 4  Dec-16       650    500
COMPANY B  PRODUCT 4  Feb-10       600    700
COMPANY B  PRODUCT 4  Jan-10       550    200
COMPANY B  PRODUCT 4  Mar-10         0    100
COMPANY B  PRODUCT 5  Dec-16      2100    450
COMPANY B  PRODUCT 5  Feb-10      3750   1700
COMPANY B  PRODUCT 5  Jan-10      1500   1850
COMPANY B  PRODUCT 5  Mar-10       550   3150
COMPANY C  PRODUCT 6  Dec-16     21250  23900
COMPANY C  PRODUCT 6  Feb-10     17250  26950
COMPANY C  PRODUCT 6  Jan-10     19300  18200
COMPANY C  PRODUCT 6  Mar-10     23600  18200

在Stata中,我会输入reshape long REVENUES COSTS, i(COMPANY PRODUCT) j(DATE) string

我如何在R?

中执行此操作

7 个答案:

答案 0 :(得分:3)

还有其他几种方法可以比已经提出的“tidyverse”选项更精简。

以下所有示例都将示例数据from JMT2080AD's answerset.seed(1)一起使用(用于再现性)。

选项1:基础R reshape

使用它并不总是一个简单的功能,但是reshape功能一旦你想出来就非常强大。在这种情况下,您没有sep,这使得事情变得有点棘手,因为您必须更加具体地了解诸如生成的变量名称以及应显示为“次”的值“(默认情况下,它们只是序号)。

times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
reshape(yourData, direction = "long", 
        varying = grep("revenues|cost", names(yourData)), sep = "", 
        v.names = c("revenues", "cost"), timevar = "date", times = times)
#             company   product    date revenues cost id
# 1.Jan2010 Company A Product 1 Jan2010     2862 1164  1
# 2.Jan2010 Company A Product 2 Jan2010     2152 1430  2
# 3.Jan2010 Company B Product 3 Jan2010     2073 1932  3
# 4.Jan2010 Company B Product 4 Jan2010      654 2771  4
# 5.Jan2010 Company B Product 5 Jan2010     1015 1004  5
# 6.Jan2010 Company C Product 6 Jan2010      941 2746  6
# ....

这几乎就是你要找的东西,也许在日期格式方面有点不同。

选项2:data.table

如果您正在追求性能,则可以从“data.table”查看melt,您应该可以使用以下内容执行此操作。与reshape方法一样,您需要存储“时间”以重新引入melt数据之后的日期。

(注意:我知道这与@Uwe's approach非常相似。)

library(data.table)
times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
melt(as.data.table(yourData), measure.vars = patterns("revenues", "cost"),
     value.name = c("revenues", "cost"))[
       , variable := factor(variable, labels = times)][]
#       company   product variable revenues cost
#  1: Company A Product 1  Jan2010     1164 1168
#  2: Company A Product 2  Jan2010     1430 1465
#  3: Company B Product 3  Jan2010     1932  533
#  4: Company B Product 4  Jan2010     2771 1456
#  5: Company B Product 5  Jan2010     1004 2674
# ---                                           
# 20: Company A Product 2  Apr2010     2444 1883
# 21: Company B Product 3  Apr2010     2837 1824
# 22: Company B Product 4  Apr2010     1030 2473
# 23: Company B Product 5  Apr2010     2129  558
# 24: Company C Product 6  Apr2010      814 1693

选项3:merged.stack

我的“splitstackshape”pacakge有一个名为merged.stack的函数,它试图让这种特殊的整形变得更容易。有了它,你可以尝试:

library(splitstackshape)
merged.stack(yourData, var.stubs = c("revenues", "cost"), sep = "var.stubs")
#       company   product .time_1 revenues cost
#  1: Company A Product 1 Apr2010     1450 2457
#  2: Company A Product 1 Feb2010     2862 1705
#  3: Company A Product 1 Jan2010     1164 1168
#  4: Company A Product 1 Mar2010     2218 2486
#  5: Company A Product 2 Apr2010     2444 1883
#  6: Company A Product 2 Feb2010     2152 1999
#  7: Company A Product 2 Jan2010     1430 1465
#  8: Company A Product 2 Mar2010     1460  770
#  9: Company B Product 3 Apr2010     2837 1824
# 10: Company B Product 3 Feb2010     2073 1734
# ... 

总有一天,我会更新函数,该函数是在“data.table”中melt之前编写的,可以处理半宽输出格式。我已经提出了a partial solution,但后来我停止了摆弄它。

事实上,使用上面的链接功能,解决方案很简单:

ReshapeLong_(yourData, c("revenues", "cost"))

选项4:来自“tidyverse”

extract

使用tidyverse的其他解决方案似乎是以非常奇怪的方式处理事情。更好的解决方案是使用extract将所需数据导入新列。您必须先将gather数据转换为非常长的格式,然后将spread数据转换为宽格式。

以下是我将使用的方法:

library(tidyverse)
yourData %>% 
  gather(var, val, -company, -product) %>%
  extract(var, into = c("type", "month", "year"), 
          regex = ("(revenues|cost)(...)(.*)")) %>%
  spread(type, val)
#      company   product month year cost revenues
# 1  Company A Product 1   Apr 2010 2457     1450
# 2  Company A Product 1   Feb 2010 1705     2862
# 3  Company A Product 1   Jan 2010 1168     1164
# 4  Company A Product 1   Mar 2010 2486     2218
# 5  Company A Product 2   Apr 2010 1883     2444
# 6  Company A Product 2   Feb 2010 1999     2152
# ...

答案 1 :(得分:1)

这里棘手的是你把日期打包到列名中。必须先解析这些,然后才能按照自己的意愿制作表格。我遍历每一列,解析每个子表列名称的日期和类型的观察,绑定每个子表,然后铸造成本/收入。我确信那里有更优雅的解决方案。

library(reshape)

## making a table similar to yours here
yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
                       product = paste("Product", 1:6),
                       revenuesJan2010 = round(runif(6, 500, 3000)),
                       revenuesFeb2010 = round(runif(6, 500, 3000)),
                       revenuesMar2010 = round(runif(6, 500, 3000)),
                       revenuesApr2010 = round(runif(6, 500, 3000)),
                       costJan2010 = round(runif(6, 500, 3000)),
                       costFeb2010 = round(runif(6, 500, 3000)),
                       costMar2010 = round(runif(6, 500, 3000)),
                       costApr2010 = round(runif(6, 500, 3000)))

## a function that parses the date from the column name
columnParse <- function(tab){
    colNm   <- names(tab)[3]
    names(tab)[3] <- "value"
    colDate  <- strsplit(colNm, "revenues|cost")[[1]][2]
    colDate  <- gsub("([A-Za-z]+)", "\\1-", colDate)
    tab$date <- colDate
    tab$type <- gsub("(revenues|cost).*", "\\1", colNm)
    return(tab)
}

## running that function against sub tables of your data, then binding
yourDataLong <- do.call(rbind,
                        lapply(3:ncol(yourData),
                               function(x) columnParse(yourData[c(1:2, x)])))

## casting your data on cost/revenue
yourDataCast <- cast(yourDataLong, company+product+date~type, value = "value")

答案 2 :(得分:1)

以下是使用tidyversestringr的其他选项:

yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
                   product = paste("Product", 1:6),
                   REVENUESJan2010 = round(runif(6, 500, 3000)),
                   REVENUESFeb2010 = round(runif(6, 500, 3000)),
                   REVENUESMar2010 = round(runif(6, 500, 3000)),
                   REVENUESApr2010 = round(runif(6, 500, 3000)),
                   COSTSJan2010 = round(runif(6, 500, 3000)),
                   COSTSFeb2010 = round(runif(6, 500, 3000)),
                   COSTSMar2010 = round(runif(6, 500, 3000)),
                   COSTSApr2010 = round(runif(6, 500, 3000)))

使用tidyversestringr的解决方案:

library(tidyverse)
library(stringr)

newData <- yourData %>%
   gather(key = rev.cost.date, value, -company, -product) %>%
   mutate(finance.type = ifelse(str_detect(rev.cost.date, fixed("REVENUES")), "REVENUES", "COSTS")) %>%
   mutate(date = str_replace(rev.cost.date, "REVENUES|COSTS", "")) %>%
   select(-rev.cost.date) %>%
   spread(value = value, key = finance.type) %>%
   mutate(date = paste0(str_sub(date, 0, 3), "-", str_sub(date, 4,8))

答案 3 :(得分:1)

从版本1.9.6(2015年9月19日CRAN)开始,data.table可以同时融合多个列(使用patterns()功能)。因此,以REVENUESCOSTS开头的列可以收集到两个单独的列中。

此外,日期(月)将打包到没有分隔符的列名称中。这些是使用带有look-behind的正则表达式从列名中提取的,用于替换DATE列的因子级别。

library(data.table)
library(magrittr)
cols <- c("REVENUES", "COSTS")
long <- melt(wide, measure.vars = patterns(cols), value.name = cols, variable.name = "DATE")
months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit() 
long[, DATE := forcats::lvls_revalue(DATE, months)]
long
      COMPANY   PRODUCT      DATE REVENUES COSTS
 1: COMPANY A PRODUCT 1   JAN2010     6400  8500
 2: COMPANY A PRODUCT 2   JAN2010     2700  2850
 3: COMPANY B PRODUCT 3   JAN2010     5900  4200
 4: COMPANY B PRODUCT 4   JAN2010      550   200
 5: COMPANY B PRODUCT 5   JAN2010     1500  1850
 6: COMPANY C PRODUCT 6   JAN2010    19300 18200
 7: COMPANY A PRODUCT 1   FEB2010    11050 10400
 8: COMPANY A PRODUCT 2   FEB2010     3000  2400
 9: COMPANY B PRODUCT 3   FEB2010     4150  6100
10: COMPANY B PRODUCT 4   FEB2010      600   700
11: COMPANY B PRODUCT 5   FEB2010     3750  1700
12: COMPANY C PRODUCT 6   FEB2010    17250 26950
13: COMPANY A PRODUCT 1 MARCH2010     6550  9100
14: COMPANY A PRODUCT 2 MARCH2010     2800  3100
15: COMPANY B PRODUCT 3 MARCH2010     5750  2950
16: COMPANY B PRODUCT 4 MARCH2010        0   100
17: COMPANY B PRODUCT 5 MARCH2010      550  3150
18: COMPANY C PRODUCT 6 MARCH2010    23600 18200
19: COMPANY A PRODUCT 1   DEC2016    10600  9850
20: COMPANY A PRODUCT 2   DEC2016     3800  3250
21: COMPANY B PRODUCT 3   DEC2016     3750  4600
22: COMPANY B PRODUCT 4   DEC2016      650   500
23: COMPANY B PRODUCT 5   DEC2016     2100   450
24: COMPANY C PRODUCT 6   DEC2016    21250 23900
      COMPANY   PRODUCT      DATE REVENUES COSTS

编辑:使用ISO月份命名方案进行正确排序

使用字母月份名称和年份的命名方案不允许按DATE正确排序数据。 DEC2016之前的FEB2010FEB2010之前的JAN2010。 ISO 8601命名惯例将年份放在首位,然后是月份数。

我们可以使用以下命名方案:

months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit() %>%
  paste0("01", .) %>% lubridate::dmy() %>% format("%Y-%m")
long[, DATE := forcats::lvls_revalue(DATE, months)]
long
      COMPANY   PRODUCT    DATE REVENUES COSTS
 1: COMPANY A PRODUCT 1 2010-01     6400  8500
 2: COMPANY A PRODUCT 2 2010-01     2700  2850
 3: COMPANY B PRODUCT 3 2010-01     5900  4200
 4: COMPANY B PRODUCT 4 2010-01      550   200
 5: COMPANY B PRODUCT 5 2010-01     1500  1850
 6: COMPANY C PRODUCT 6 2010-01    19300 18200
 7: COMPANY A PRODUCT 1 2010-02    11050 10400
 8: COMPANY A PRODUCT 2 2010-02     3000  2400
 9: COMPANY B PRODUCT 3 2010-02     4150  6100
10: COMPANY B PRODUCT 4 2010-02      600   700
11: COMPANY B PRODUCT 5 2010-02     3750  1700
12: COMPANY C PRODUCT 6 2010-02    17250 26950
13: COMPANY A PRODUCT 1 2010-03     6550  9100
14: COMPANY A PRODUCT 2 2010-03     2800  3100
15: COMPANY B PRODUCT 3 2010-03     5750  2950
16: COMPANY B PRODUCT 4 2010-03        0   100
17: COMPANY B PRODUCT 5 2010-03      550  3150
18: COMPANY C PRODUCT 6 2010-03    23600 18200
19: COMPANY A PRODUCT 1 2016-12    10600  9850
20: COMPANY A PRODUCT 2 2016-12     3800  3250
21: COMPANY B PRODUCT 3 2016-12     3750  4600
22: COMPANY B PRODUCT 4 2016-12      650   500
23: COMPANY B PRODUCT 5 2016-12     2100   450
24: COMPANY C PRODUCT 6 2016-12    21250 23900
      COMPANY   PRODUCT    DATE REVENUES COSTS

数据

library(data.table)
wide <- data.table(
readr::read_table(
"  COMPANY   PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010     REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010     COSTSDEC2016
COMPANY A PRODUCT 1            6400           11050              6550               10600         8500        10400           9100             9850
COMPANY A PRODUCT 2            2700            3000              2800                3800         2850         2400           3100             3250
COMPANY B PRODUCT 3            5900            4150              5750                3750         4200         6100           2950             4600
COMPANY B PRODUCT 4             550             600                 0                 650          200          700            100              500
COMPANY B PRODUCT 5            1500            3750               550                2100         1850         1700           3150              450
COMPANY C PRODUCT 6           19300           17250             23600               21250        18200        26950          18200            23900"
))

答案 4 :(得分:1)

我认为在R中从宽到长整形的最显式(即无需重命名变量)方法是使用基R reshape()函数并将要“堆叠”的不同列指定为{ {1}}。请参阅this博客文章。

我将使用JMT2080AD's answer中的数据并将种子设置为list

set.seed(789)

使用### Create a list of the variables you want to reshape/stack reshape.vars <- list(c("revenuesJan2010", "revenuesFeb2010", "revenuesMar2010", "revenuesApr2010"), # revenues c("costJan2010", "costFeb2010", "costMar2010", "costApr2010")) # cost ### reshape wide to long reshape(yourData, #dataframe direction="long", #wide to long varying=reshape.vars, #repeated measures list of indexes for vars to stack/reshape timevar="date", #the repeated measures times v.names=c("revenues", "cost")) #the repeated measures names # company product date revenues cost id # 1.1 Company A Product 1 1 2250 1574 1 # 2.1 Company A Product 2 1 734 1793 2 # 3.1 Company B Product 3 1 530 1282 3 # 4.1 Company B Product 4 1 1979 1741 4 # 5.1 Company B Product 5 1 1730 2558 5 # 6.1 Company C Product 6 1 550 1757 6 # 1.2 Company A Product 1 2 1932 1048 1 #... # 5.3 Company B Product 5 3 890 1103 5 # 6.3 Company C Product 6 3 2113 2469 6 # 1.4 Company A Product 1 4 2426 2382 1 # 2.4 Company A Product 2 4 778 2995 2 # 3.4 Company B Product 3 4 1359 989 3 # 4.4 Company B Product 4 4 1618 912 4 # 5.4 Company B Product 5 4 895 2109 5 # 6.4 Company C Product 6 4 1258 2803 6 方法

  • 您不必重命名变量
  • 由于要创建的变量已在列表中明确定义,因此与list推断应该堆叠的变量没有错误

我发现即使要重塑100多个变量,如果重命名也很麻烦,那么使用复制/粘贴来创建变量列表的时间就不会那么长。

答案 5 :(得分:0)

作为一个热衷于重塑状态的转换者,我发现tidyr :: gather和tidyr :: spread非常直观。聚集基本上可以重塑,而扩散可以重塑。

以下是将您的数据更改为所需方式的代码:

new_data <- 
gather(data = your-data-frame, 
       key = var_holder,
       value = val_holder,
       -company,
       -product) 

new_data$var_holder <- sub("REVENUE", "cost_", new_data$var_holder)                                     
new_data$var_holder <- sub("COST", "cost_", new_data$var_holder)

new_data <- 
    separate(data = new_data,
             col = var_holder,
             into = c("var", "date")) %>%
    spread(key = var,
           value = val_holder)

完成!

gather通过获取所有指定的变量名来工作(或在此变量中,未指定,请注意两个变量前面带有“-”符号),并将它们放在一个新的变量下,该变量的名称由“ key =。”指定。 ”(创建新行)。然后,它将落入这些变量下的值放在一个单独的变量下,该变量的名称由“ value = ...”指定。

传播的方向相反。希望这会有所帮助!

答案 6 :(得分:0)

使用tidyr(版本-“ 0.8.3.9000”)的精简版本的选项

library(dplyr)
library(tidyr)
library(stringr)
library(zoo)
library(readr)

df1 %>% 
   rename_at(3:ncol(.), ~ str_replace(., "^(REVENUES|COSTS)", "\\1_")) %>%
   pivot_longer(c(-COMPANY, -PRODUCT), names_to = c(".value", "DATE"), names_sep = "_") %>% 
   mutate(DATE = format(as.yearmon(DATE), "%b-%Y"))
# A tibble: 24 x 5
#   COMPANY   PRODUCT   DATE     REVENUES COSTS
#   <chr>     <chr>     <chr>       <dbl> <dbl>
# 1 COMPANY A PRODUCT 1 Jan-2010     6400  8500
# 2 COMPANY A PRODUCT 1 Feb-2010    11050 10400
# 3 COMPANY A PRODUCT 1 Mar-2010     6550  9100
# 4 COMPANY A PRODUCT 1 Dec-2016    10600  9850
# 5 COMPANY A PRODUCT 2 Jan-2010     2700  2850
# 6 COMPANY A PRODUCT 2 Feb-2010     3000  2400
# 7 COMPANY A PRODUCT 2 Mar-2010     2800  3100
# 8 COMPANY A PRODUCT 2 Dec-2016     3800  3250
# 9 COMPANY B PRODUCT 3 Jan-2010     5900  4200
#10 COMPANY B PRODUCT 3 Feb-2010     4150  6100
# … with 14 more rows

数据

df1 <- structure(list(COMPANY = c("COMPANY A", "COMPANY A", "COMPANY B", 
"COMPANY B", "COMPANY B", "COMPANY C"), PRODUCT = c("PRODUCT 1", 
"PRODUCT 2", "PRODUCT 3", "PRODUCT 4", "PRODUCT 5", "PRODUCT 6"
), REVENUESJAN2010 = c(6400, 2700, 5900, 550, 1500, 19300), REVENUESFEB2010 = c(11050, 
3000, 4150, 600, 3750, 17250), REVENUESMARCH2010 = c(6550, 2800, 
5750, 0, 550, 23600), REVENUESDEC2016 = c(10600, 3800, 3750, 
650, 2100, 21250), COSTSJAN2010 = c(8500, 2850, 4200, 200, 1850, 
18200), COSTSFEB2010 = c(10400, 2400, 6100, 700, 1700, 26950), 
    COSTSMARCH2010 = c(9100, 3100, 2950, 100, 3150, 18200), COSTSDEC2016 = c(9850, 
    3250, 4600, 500, 450, 23900)), class = c("spec_tbl_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(
    cols = list(COMPANY = structure(list(), class = c("collector_character", 
    "collector")), PRODUCT = structure(list(), class = c("collector_character", 
    "collector")), REVENUESJAN2010 = structure(list(), class = c("collector_double", 
    "collector")), REVENUESFEB2010 = structure(list(), class = c("collector_double", 
    "collector")), REVENUESMARCH2010 = structure(list(), class = c("collector_double", 
    "collector")), REVENUESDEC2016 = structure(list(), class = c("collector_double", 
    "collector")), COSTSJAN2010 = structure(list(), class = c("collector_double", 
    "collector")), COSTSFEB2010 = structure(list(), class = c("collector_double", 
    "collector")), COSTSMARCH2010 = structure(list(), class = c("collector_double", 
    "collector")), COSTSDEC2016 = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))