我有一个非常大的数据集,我需要从长到长重塑。
我的数据集看起来像:
COMPANY PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010 ... REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010 ... COSTSDEC2016
COMPANY A PRODUCT 1 6400 11050 6550 10600 8500 10400 9100 9850
COMPANY A PRODUCT 2 2700 3000 2800 3800 2850 2400 3100 3250
COMPANY B PRODUCT 3 5900 4150 5750 3750 4200 6100 2950 4600
COMPANY B PRODUCT 4 550 600 0 650 200 700 100 500
COMPANY B PRODUCT 5 1500 3750 550 2100 1850 1700 3150 450
COMPANY C PRODUCT 6 19300 17250 23600 21250 18200 26950 18200 23900
我希望它们看起来像:
COMPANY PRODUCT DATE REVENUES COSTS
COMPANY A PRODUCT 1 Dec-16 10600 9850
COMPANY A PRODUCT 1 Feb-10 11050 10400
COMPANY A PRODUCT 1 Jan-10 6400 8500
COMPANY A PRODUCT 1 Mar-10 6550 9100
COMPANY A PRODUCT 2 Dec-16 3800 3250
COMPANY A PRODUCT 2 Feb-10 3000 2400
COMPANY A PRODUCT 2 Jan-10 2700 2850
COMPANY A PRODUCT 2 Mar-10 2800 3100
COMPANY B PRODUCT 3 Dec-16 3750 4600
COMPANY B PRODUCT 3 Feb-10 4150 6100
COMPANY B PRODUCT 3 Jan-10 5900 4200
COMPANY B PRODUCT 3 Mar-10 5750 2950
COMPANY B PRODUCT 4 Dec-16 650 500
COMPANY B PRODUCT 4 Feb-10 600 700
COMPANY B PRODUCT 4 Jan-10 550 200
COMPANY B PRODUCT 4 Mar-10 0 100
COMPANY B PRODUCT 5 Dec-16 2100 450
COMPANY B PRODUCT 5 Feb-10 3750 1700
COMPANY B PRODUCT 5 Jan-10 1500 1850
COMPANY B PRODUCT 5 Mar-10 550 3150
COMPANY C PRODUCT 6 Dec-16 21250 23900
COMPANY C PRODUCT 6 Feb-10 17250 26950
COMPANY C PRODUCT 6 Jan-10 19300 18200
COMPANY C PRODUCT 6 Mar-10 23600 18200
在Stata中,我会输入reshape long REVENUES COSTS, i(COMPANY PRODUCT) j(DATE) string
我如何在R?
中执行此操作答案 0 :(得分:3)
还有其他几种方法可以比已经提出的“tidyverse”选项更精简。
以下所有示例都将示例数据from JMT2080AD's answer与set.seed(1)
一起使用(用于再现性)。
reshape
使用它并不总是一个简单的功能,但是reshape
功能一旦你想出来就非常强大。在这种情况下,您没有sep
,这使得事情变得有点棘手,因为您必须更加具体地了解诸如生成的变量名称以及应显示为“次”的值“(默认情况下,它们只是序号)。
times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
reshape(yourData, direction = "long",
varying = grep("revenues|cost", names(yourData)), sep = "",
v.names = c("revenues", "cost"), timevar = "date", times = times)
# company product date revenues cost id
# 1.Jan2010 Company A Product 1 Jan2010 2862 1164 1
# 2.Jan2010 Company A Product 2 Jan2010 2152 1430 2
# 3.Jan2010 Company B Product 3 Jan2010 2073 1932 3
# 4.Jan2010 Company B Product 4 Jan2010 654 2771 4
# 5.Jan2010 Company B Product 5 Jan2010 1015 1004 5
# 6.Jan2010 Company C Product 6 Jan2010 941 2746 6
# ....
这几乎就是你要找的东西,也许在日期格式方面有点不同。
data.table
如果您正在追求性能,则可以从“data.table”查看melt
,您应该可以使用以下内容执行此操作。与reshape
方法一样,您需要存储“时间”以重新引入melt
数据之后的日期。
(注意:我知道这与@Uwe's approach非常相似。)
library(data.table)
times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
melt(as.data.table(yourData), measure.vars = patterns("revenues", "cost"),
value.name = c("revenues", "cost"))[
, variable := factor(variable, labels = times)][]
# company product variable revenues cost
# 1: Company A Product 1 Jan2010 1164 1168
# 2: Company A Product 2 Jan2010 1430 1465
# 3: Company B Product 3 Jan2010 1932 533
# 4: Company B Product 4 Jan2010 2771 1456
# 5: Company B Product 5 Jan2010 1004 2674
# ---
# 20: Company A Product 2 Apr2010 2444 1883
# 21: Company B Product 3 Apr2010 2837 1824
# 22: Company B Product 4 Apr2010 1030 2473
# 23: Company B Product 5 Apr2010 2129 558
# 24: Company C Product 6 Apr2010 814 1693
merged.stack
我的“splitstackshape”pacakge有一个名为merged.stack
的函数,它试图让这种特殊的整形变得更容易。有了它,你可以尝试:
library(splitstackshape)
merged.stack(yourData, var.stubs = c("revenues", "cost"), sep = "var.stubs")
# company product .time_1 revenues cost
# 1: Company A Product 1 Apr2010 1450 2457
# 2: Company A Product 1 Feb2010 2862 1705
# 3: Company A Product 1 Jan2010 1164 1168
# 4: Company A Product 1 Mar2010 2218 2486
# 5: Company A Product 2 Apr2010 2444 1883
# 6: Company A Product 2 Feb2010 2152 1999
# 7: Company A Product 2 Jan2010 1430 1465
# 8: Company A Product 2 Mar2010 1460 770
# 9: Company B Product 3 Apr2010 2837 1824
# 10: Company B Product 3 Feb2010 2073 1734
# ...
总有一天,我会更新函数,该函数是在“data.table”中melt
之前编写的,可以处理半宽输出格式。我已经提出了a partial solution,但后来我停止了摆弄它。
事实上,使用上面的链接功能,解决方案很简单:
ReshapeLong_(yourData, c("revenues", "cost"))
extract
使用tidyverse的其他解决方案似乎是以非常奇怪的方式处理事情。更好的解决方案是使用extract
将所需数据导入新列。您必须先将gather
数据转换为非常长的格式,然后将spread
数据转换为宽格式。
以下是我将使用的方法:
library(tidyverse)
yourData %>%
gather(var, val, -company, -product) %>%
extract(var, into = c("type", "month", "year"),
regex = ("(revenues|cost)(...)(.*)")) %>%
spread(type, val)
# company product month year cost revenues
# 1 Company A Product 1 Apr 2010 2457 1450
# 2 Company A Product 1 Feb 2010 1705 2862
# 3 Company A Product 1 Jan 2010 1168 1164
# 4 Company A Product 1 Mar 2010 2486 2218
# 5 Company A Product 2 Apr 2010 1883 2444
# 6 Company A Product 2 Feb 2010 1999 2152
# ...
答案 1 :(得分:1)
这里棘手的是你把日期打包到列名中。必须先解析这些,然后才能按照自己的意愿制作表格。我遍历每一列,解析每个子表列名称的日期和类型的观察,绑定每个子表,然后铸造成本/收入。我确信那里有更优雅的解决方案。
library(reshape)
## making a table similar to yours here
yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
product = paste("Product", 1:6),
revenuesJan2010 = round(runif(6, 500, 3000)),
revenuesFeb2010 = round(runif(6, 500, 3000)),
revenuesMar2010 = round(runif(6, 500, 3000)),
revenuesApr2010 = round(runif(6, 500, 3000)),
costJan2010 = round(runif(6, 500, 3000)),
costFeb2010 = round(runif(6, 500, 3000)),
costMar2010 = round(runif(6, 500, 3000)),
costApr2010 = round(runif(6, 500, 3000)))
## a function that parses the date from the column name
columnParse <- function(tab){
colNm <- names(tab)[3]
names(tab)[3] <- "value"
colDate <- strsplit(colNm, "revenues|cost")[[1]][2]
colDate <- gsub("([A-Za-z]+)", "\\1-", colDate)
tab$date <- colDate
tab$type <- gsub("(revenues|cost).*", "\\1", colNm)
return(tab)
}
## running that function against sub tables of your data, then binding
yourDataLong <- do.call(rbind,
lapply(3:ncol(yourData),
function(x) columnParse(yourData[c(1:2, x)])))
## casting your data on cost/revenue
yourDataCast <- cast(yourDataLong, company+product+date~type, value = "value")
答案 2 :(得分:1)
以下是使用tidyverse
和stringr
的其他选项:
yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
product = paste("Product", 1:6),
REVENUESJan2010 = round(runif(6, 500, 3000)),
REVENUESFeb2010 = round(runif(6, 500, 3000)),
REVENUESMar2010 = round(runif(6, 500, 3000)),
REVENUESApr2010 = round(runif(6, 500, 3000)),
COSTSJan2010 = round(runif(6, 500, 3000)),
COSTSFeb2010 = round(runif(6, 500, 3000)),
COSTSMar2010 = round(runif(6, 500, 3000)),
COSTSApr2010 = round(runif(6, 500, 3000)))
使用tidyverse
和stringr
的解决方案:
library(tidyverse)
library(stringr)
newData <- yourData %>%
gather(key = rev.cost.date, value, -company, -product) %>%
mutate(finance.type = ifelse(str_detect(rev.cost.date, fixed("REVENUES")), "REVENUES", "COSTS")) %>%
mutate(date = str_replace(rev.cost.date, "REVENUES|COSTS", "")) %>%
select(-rev.cost.date) %>%
spread(value = value, key = finance.type) %>%
mutate(date = paste0(str_sub(date, 0, 3), "-", str_sub(date, 4,8))
答案 3 :(得分:1)
从版本1.9.6(2015年9月19日CRAN)开始,data.table
可以同时融合多个列(使用patterns()
功能)。因此,以REVENUES
和COSTS
开头的列可以收集到两个单独的列中。
此外,日期(月)将打包到没有分隔符的列名称中。这些是使用带有look-behind的正则表达式从列名中提取的,用于替换DATE
列的因子级别。
library(data.table)
library(magrittr)
cols <- c("REVENUES", "COSTS")
long <- melt(wide, measure.vars = patterns(cols), value.name = cols, variable.name = "DATE")
months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit()
long[, DATE := forcats::lvls_revalue(DATE, months)]
long
COMPANY PRODUCT DATE REVENUES COSTS 1: COMPANY A PRODUCT 1 JAN2010 6400 8500 2: COMPANY A PRODUCT 2 JAN2010 2700 2850 3: COMPANY B PRODUCT 3 JAN2010 5900 4200 4: COMPANY B PRODUCT 4 JAN2010 550 200 5: COMPANY B PRODUCT 5 JAN2010 1500 1850 6: COMPANY C PRODUCT 6 JAN2010 19300 18200 7: COMPANY A PRODUCT 1 FEB2010 11050 10400 8: COMPANY A PRODUCT 2 FEB2010 3000 2400 9: COMPANY B PRODUCT 3 FEB2010 4150 6100 10: COMPANY B PRODUCT 4 FEB2010 600 700 11: COMPANY B PRODUCT 5 FEB2010 3750 1700 12: COMPANY C PRODUCT 6 FEB2010 17250 26950 13: COMPANY A PRODUCT 1 MARCH2010 6550 9100 14: COMPANY A PRODUCT 2 MARCH2010 2800 3100 15: COMPANY B PRODUCT 3 MARCH2010 5750 2950 16: COMPANY B PRODUCT 4 MARCH2010 0 100 17: COMPANY B PRODUCT 5 MARCH2010 550 3150 18: COMPANY C PRODUCT 6 MARCH2010 23600 18200 19: COMPANY A PRODUCT 1 DEC2016 10600 9850 20: COMPANY A PRODUCT 2 DEC2016 3800 3250 21: COMPANY B PRODUCT 3 DEC2016 3750 4600 22: COMPANY B PRODUCT 4 DEC2016 650 500 23: COMPANY B PRODUCT 5 DEC2016 2100 450 24: COMPANY C PRODUCT 6 DEC2016 21250 23900 COMPANY PRODUCT DATE REVENUES COSTS
使用字母月份名称和年份的命名方案不允许按DATE
正确排序数据。 DEC2016
之前的FEB2010
和FEB2010
之前的JAN2010
。 ISO 8601命名惯例将年份放在首位,然后是月份数。
我们可以使用以下命名方案:
months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit() %>%
paste0("01", .) %>% lubridate::dmy() %>% format("%Y-%m")
long[, DATE := forcats::lvls_revalue(DATE, months)]
long
COMPANY PRODUCT DATE REVENUES COSTS 1: COMPANY A PRODUCT 1 2010-01 6400 8500 2: COMPANY A PRODUCT 2 2010-01 2700 2850 3: COMPANY B PRODUCT 3 2010-01 5900 4200 4: COMPANY B PRODUCT 4 2010-01 550 200 5: COMPANY B PRODUCT 5 2010-01 1500 1850 6: COMPANY C PRODUCT 6 2010-01 19300 18200 7: COMPANY A PRODUCT 1 2010-02 11050 10400 8: COMPANY A PRODUCT 2 2010-02 3000 2400 9: COMPANY B PRODUCT 3 2010-02 4150 6100 10: COMPANY B PRODUCT 4 2010-02 600 700 11: COMPANY B PRODUCT 5 2010-02 3750 1700 12: COMPANY C PRODUCT 6 2010-02 17250 26950 13: COMPANY A PRODUCT 1 2010-03 6550 9100 14: COMPANY A PRODUCT 2 2010-03 2800 3100 15: COMPANY B PRODUCT 3 2010-03 5750 2950 16: COMPANY B PRODUCT 4 2010-03 0 100 17: COMPANY B PRODUCT 5 2010-03 550 3150 18: COMPANY C PRODUCT 6 2010-03 23600 18200 19: COMPANY A PRODUCT 1 2016-12 10600 9850 20: COMPANY A PRODUCT 2 2016-12 3800 3250 21: COMPANY B PRODUCT 3 2016-12 3750 4600 22: COMPANY B PRODUCT 4 2016-12 650 500 23: COMPANY B PRODUCT 5 2016-12 2100 450 24: COMPANY C PRODUCT 6 2016-12 21250 23900 COMPANY PRODUCT DATE REVENUES COSTS
library(data.table)
wide <- data.table(
readr::read_table(
" COMPANY PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010 REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010 COSTSDEC2016
COMPANY A PRODUCT 1 6400 11050 6550 10600 8500 10400 9100 9850
COMPANY A PRODUCT 2 2700 3000 2800 3800 2850 2400 3100 3250
COMPANY B PRODUCT 3 5900 4150 5750 3750 4200 6100 2950 4600
COMPANY B PRODUCT 4 550 600 0 650 200 700 100 500
COMPANY B PRODUCT 5 1500 3750 550 2100 1850 1700 3150 450
COMPANY C PRODUCT 6 19300 17250 23600 21250 18200 26950 18200 23900"
))
答案 4 :(得分:1)
我认为在R中从宽到长整形的最显式(即无需重命名变量)方法是使用基R reshape()
函数并将要“堆叠”的不同列指定为{ {1}}。请参阅this博客文章。
我将使用JMT2080AD's answer中的数据并将种子设置为list
。
set.seed(789)
使用### Create a list of the variables you want to reshape/stack
reshape.vars <- list(c("revenuesJan2010", "revenuesFeb2010", "revenuesMar2010", "revenuesApr2010"), # revenues
c("costJan2010", "costFeb2010", "costMar2010", "costApr2010")) # cost
### reshape wide to long
reshape(yourData, #dataframe
direction="long", #wide to long
varying=reshape.vars, #repeated measures list of indexes for vars to stack/reshape
timevar="date", #the repeated measures times
v.names=c("revenues", "cost")) #the repeated measures names
# company product date revenues cost id
# 1.1 Company A Product 1 1 2250 1574 1
# 2.1 Company A Product 2 1 734 1793 2
# 3.1 Company B Product 3 1 530 1282 3
# 4.1 Company B Product 4 1 1979 1741 4
# 5.1 Company B Product 5 1 1730 2558 5
# 6.1 Company C Product 6 1 550 1757 6
# 1.2 Company A Product 1 2 1932 1048 1
#...
# 5.3 Company B Product 5 3 890 1103 5
# 6.3 Company C Product 6 3 2113 2469 6
# 1.4 Company A Product 1 4 2426 2382 1
# 2.4 Company A Product 2 4 778 2995 2
# 3.4 Company B Product 3 4 1359 989 3
# 4.4 Company B Product 4 4 1618 912 4
# 5.4 Company B Product 5 4 895 2109 5
# 6.4 Company C Product 6 4 1258 2803 6
方法
list
推断应该堆叠的变量没有错误我发现即使要重塑100多个变量,如果重命名也很麻烦,那么使用复制/粘贴来创建变量列表的时间就不会那么长。
答案 5 :(得分:0)
作为一个热衷于重塑状态的转换者,我发现tidyr :: gather和tidyr :: spread非常直观。聚集基本上可以重塑,而扩散可以重塑。
以下是将您的数据更改为所需方式的代码:
new_data <-
gather(data = your-data-frame,
key = var_holder,
value = val_holder,
-company,
-product)
new_data$var_holder <- sub("REVENUE", "cost_", new_data$var_holder)
new_data$var_holder <- sub("COST", "cost_", new_data$var_holder)
new_data <-
separate(data = new_data,
col = var_holder,
into = c("var", "date")) %>%
spread(key = var,
value = val_holder)
完成!
gather通过获取所有指定的变量名来工作(或在此变量中,未指定,请注意两个变量前面带有“-”符号),并将它们放在一个新的变量下,该变量的名称由“ key =。”指定。 ”(创建新行)。然后,它将落入这些变量下的值放在一个单独的变量下,该变量的名称由“ value = ...”指定。
传播的方向相反。希望这会有所帮助!
答案 6 :(得分:0)
使用tidyr
(版本-“ 0.8.3.9000”)的精简版本的选项
library(dplyr)
library(tidyr)
library(stringr)
library(zoo)
library(readr)
df1 %>%
rename_at(3:ncol(.), ~ str_replace(., "^(REVENUES|COSTS)", "\\1_")) %>%
pivot_longer(c(-COMPANY, -PRODUCT), names_to = c(".value", "DATE"), names_sep = "_") %>%
mutate(DATE = format(as.yearmon(DATE), "%b-%Y"))
# A tibble: 24 x 5
# COMPANY PRODUCT DATE REVENUES COSTS
# <chr> <chr> <chr> <dbl> <dbl>
# 1 COMPANY A PRODUCT 1 Jan-2010 6400 8500
# 2 COMPANY A PRODUCT 1 Feb-2010 11050 10400
# 3 COMPANY A PRODUCT 1 Mar-2010 6550 9100
# 4 COMPANY A PRODUCT 1 Dec-2016 10600 9850
# 5 COMPANY A PRODUCT 2 Jan-2010 2700 2850
# 6 COMPANY A PRODUCT 2 Feb-2010 3000 2400
# 7 COMPANY A PRODUCT 2 Mar-2010 2800 3100
# 8 COMPANY A PRODUCT 2 Dec-2016 3800 3250
# 9 COMPANY B PRODUCT 3 Jan-2010 5900 4200
#10 COMPANY B PRODUCT 3 Feb-2010 4150 6100
# … with 14 more rows
df1 <- structure(list(COMPANY = c("COMPANY A", "COMPANY A", "COMPANY B",
"COMPANY B", "COMPANY B", "COMPANY C"), PRODUCT = c("PRODUCT 1",
"PRODUCT 2", "PRODUCT 3", "PRODUCT 4", "PRODUCT 5", "PRODUCT 6"
), REVENUESJAN2010 = c(6400, 2700, 5900, 550, 1500, 19300), REVENUESFEB2010 = c(11050,
3000, 4150, 600, 3750, 17250), REVENUESMARCH2010 = c(6550, 2800,
5750, 0, 550, 23600), REVENUESDEC2016 = c(10600, 3800, 3750,
650, 2100, 21250), COSTSJAN2010 = c(8500, 2850, 4200, 200, 1850,
18200), COSTSFEB2010 = c(10400, 2400, 6100, 700, 1700, 26950),
COSTSMARCH2010 = c(9100, 3100, 2950, 100, 3150, 18200), COSTSDEC2016 = c(9850,
3250, 4600, 500, 450, 23900)), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(
cols = list(COMPANY = structure(list(), class = c("collector_character",
"collector")), PRODUCT = structure(list(), class = c("collector_character",
"collector")), REVENUESJAN2010 = structure(list(), class = c("collector_double",
"collector")), REVENUESFEB2010 = structure(list(), class = c("collector_double",
"collector")), REVENUESMARCH2010 = structure(list(), class = c("collector_double",
"collector")), REVENUESDEC2016 = structure(list(), class = c("collector_double",
"collector")), COSTSJAN2010 = structure(list(), class = c("collector_double",
"collector")), COSTSFEB2010 = structure(list(), class = c("collector_double",
"collector")), COSTSMARCH2010 = structure(list(), class = c("collector_double",
"collector")), COSTSDEC2016 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))