更新:我应该更清楚一点,我试图在使用data.tables https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html重新整形时检查增强功能。更新了标题。
我的数据集包含两组变量 - Credit_Risk_Capital和Name_concentration。它们按照两种方法计算 - 新旧方法。当我使用data.table包解压它们时,变量名默认为1和2.如何将它们更改为Credit_Risk_Capital和Name_Concentration。
这是数据集
df <-data.table (id = c(1:100),Credit_risk_Capital_old= rnorm(100, mean = 400, sd = 60),
NameConcentration_old= rnorm(100, mean = 100, sd = 10),
Credit_risk_Capital_New =rnorm(100, mean = 200, sd = 10),
NameConcentration_New = rnorm(100, mean = 40, sd = 10))
old <- c('Credit_risk_Capital_old','NameConcentration_old')
new<-c('Credit_risk_Capital_New','NameConcentration_New')
t1<-melt(df, measure.vars = list(old,new), variable.name = "CapitalChargeType",value.name = c("old","new"))
现在,我不想将CapitalChargeType列中的元素标记为1和2,而是希望将它们更改为Credit_risk_Capital和NameConcentration。我显然可以在后续步骤中使用“匹配”来更改它们。功能,但无论如何,我可以在融化本身内做到这一点。
答案 0 :(得分:2)
这里的问题是melt()
在多个度量变量的情况下不知道如何命名变量。因此,它只是简单地对变量进行编号。
David已经指出有一个feature request。但是,我将展示两种解决方法,并在速度方面对它们进行比较(加上the tidyr
answer)。
melt()
所有度量变量(保留变量名称),创建新变量名称,再次dcast()
临时结果以最终得到两个值列。 austensen也正在使用此重铸方法。library(data.table) # CRAN version 1.10.4 used
# melt all measure variables
long <- melt(df, id.vars = "id")
# split variables names
long[, c("CapitalChargeType", "age") :=
tstrsplit(variable, "_(?=(New|old)$)", perl = TRUE)]
dcast(long, id + CapitalChargeType ~ age)
id CapitalChargeType New old 1: 1 Credit_risk_Capital 204.85227 327.57606 2: 1 NameConcentration 34.20043 104.14524 3: 2 Credit_risk_Capital 206.96769 416.64575 4: 2 NameConcentration 30.46721 95.25282 5: 3 Credit_risk_Capital 201.85514 465.06647 --- 196: 98 NameConcentration 45.38833 90.34097 197: 99 Credit_risk_Capital 203.53625 458.37501 198: 99 NameConcentration 40.14643 101.62655 199: 100 Credit_risk_Capital 203.19156 527.26703 200: 100 NameConcentration 30.83511 79.21762
请注意,变量名称在最后_
或old
之前的最后New
处拆分。这是通过使用带有正向前瞻的正则表达式来实现的:"_(?=(New|old)$)"
在这里,我们选择David's suggestion来使用patterns()
函数,这相当于指定度量变量列表。
作为旁注:列表(或模式)的顺序决定了值列的顺序:
melt(df, measure.vars = patterns("New$", "old$"))
id variable value1 value2 1: 1 1 204.85227 327.57606 2: 2 1 206.96769 416.64575 3: 3 1 201.85514 465.06647 ...
melt(df, measure.vars = patterns("old$", "New$"))
id variable value1 value2 1: 1 1 327.57606 204.85227 2: 2 1 416.64575 206.96769 3: 3 1 465.06647 201.85514 ...
正如OP已经指出的那样,用多个测量变量进行融合
long <- melt(df, measure.vars = patterns("old$", "New$"),
variable.name = "CapitalChargeType",
value.name = c("old", "New"))
返回数字而不是变量名:
str(long)
Classes ‘data.table’ and 'data.frame': 200 obs. of 4 variables: $ id : int 1 2 3 4 5 6 7 8 9 10 ... $ CapitalChargeType: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ... $ old : num 328 417 465 259 426 ... $ New : num 205 207 202 207 203 ... - attr(*, ".internal.selfref")=<externalptr>
幸运的是,这些是可以通过forcats
包替换因子级别来轻松更改的因素:
long[, CapitalChargeType := forcats::lvls_revalue(
CapitalChargeType,
c("Credit_risk_Capital", "NameConcentration"))]
long[order(id)]
id CapitalChargeType old New 1: 1 Credit_risk_Capital 327.57606 204.85227 2: 1 NameConcentration 104.14524 34.20043 3: 2 Credit_risk_Capital 416.64575 206.96769 4: 2 NameConcentration 95.25282 30.46721 5: 3 Credit_risk_Capital 465.06647 201.85514 --- 196: 98 NameConcentration 90.34097 45.38833 197: 99 Credit_risk_Capital 458.37501 203.53625 198: 99 NameConcentration 101.62655 40.14643 199: 100 Credit_risk_Capital 527.26703 203.19156 200: 100 NameConcentration 79.21762 30.83511
请注意,melt()
按照df
中列的显示顺序对变量进行编号。
reshape()
基础R的stats
包具有reshape()
功能。不幸的是,它不接受具有正面预测的正则表达式。因此,不能使用自动猜测变量名称。相反,必须明确指定所有相关参数:
old <- c('Credit_risk_Capital_old', 'NameConcentration_old')
new <- c('Credit_risk_Capital_New', 'NameConcentration_New')
reshape(df, varying = list(old, new), direction = "long",
timevar = "CapitalChargeType",
times = c("Credit_risk_Capital", "NameConcentration"),
v.names = c("old", "New"))
id CapitalChargeType old New 1: 1 Credit_risk_Capital 367.95567 194.93598 2: 2 Credit_risk_Capital 467.98061 215.39663 3: 3 Credit_risk_Capital 363.75586 201.72794 4: 4 Credit_risk_Capital 433.45070 191.64176 5: 5 Credit_risk_Capital 408.55776 193.44071 --- 196: 96 NameConcentration 93.67931 47.85263 197: 97 NameConcentration 101.32361 46.94047 198: 98 NameConcentration 104.80926 33.67270 199: 99 NameConcentration 101.33178 32.28041 200: 100 NameConcentration 85.37136 63.57817
该基准包括目前讨论的所有4种方法:
tidyr
,修改后使用具有正面预测的正常表达式,recast
,melt()
和reshape()
。基准数据包含100 K行:
n_rows <- 100L
set.seed(1234L)
df <- data.table(
id = c(1:n_rows),
Credit_risk_Capital_old = rnorm(n_rows, mean = 400, sd = 60),
NameConcentration_old = rnorm(n_rows, mean = 100, sd = 10),
Credit_risk_Capital_New = rnorm(n_rows, mean = 200, sd = 10),
NameConcentration_New = rnorm(n_rows, mean = 40, sd = 10))
对于基准测试,使用microbenchmark
包:
library(magrittr)
old <- c('Credit_risk_Capital_old', 'NameConcentration_old')
new <- c('Credit_risk_Capital_New', 'NameConcentration_New')
microbenchmark::microbenchmark(
tidyr = {
r_tidyr <- df %>%
dplyr::as_data_frame() %>%
tidyr::gather("key", "value", -id) %>%
tidyr::separate(key, c("CapitalChargeType", "age"), sep = "_(?=(New|old)$)") %>%
tidyr::spread(age, value)
},
recast = {
r_recast <- dcast(
melt(df, id.vars = "id")[
, c("CapitalChargeType", "age") :=
tstrsplit(variable, "_(?=(New|old)$)", perl = TRUE)],
id + CapitalChargeType ~ age)
},
m2col = {
r_m2col <- melt(df, measure.vars = patterns("New$", "old$"),
variable.name = "CapitalChargeType",
value.name = c("New", "old"))[
, CapitalChargeType := forcats::lvls_revalue(
CapitalChargeType,
c("Credit_risk_Capital", "NameConcentration"))][order(id)]
},
reshape = {
r_reshape <- reshape(df, varying = list(new, old), direction = "long",
timevar = "CapitalChargeType",
times = c("Credit_risk_Capital", "NameConcentration"),
v.names = c("New", "old")
)
},
times = 10L
)
Unit: milliseconds expr min lq mean median uq max neval tidyr 705.20364 789.63010 832.11391 813.08830 825.15259 1091.3188 10 recast 215.35813 223.60715 287.28034 261.23333 338.36813 477.3355 10 m2col 10.28721 11.35237 38.72393 14.46307 23.64113 154.3357 10 reshape 143.75546 171.68592 379.05752 224.13671 269.95301 1730.5892 10
时间显示两列同时melt()
比第二快reshape()
快约15倍。两个recast
变体都落后了,因为它们都需要两次重塑操作。 tidyr
解决方案特别慢。
答案 1 :(得分:1)
我不确定使用melt
,但这是使用tidyr
请注意,我更改了变量名称以使用.
而不是_
来分隔old
/ new
的名称。这样可以更容易地将名称分成两个变量,因为已经存在许多下划线。
library(tidyr)
df <- dplyr::data_frame(
id = c(1:100),
Credit_risk_Capital.old= rnorm(100, mean = 400, sd = 60),
NameConcentration.old= rnorm(100, mean = 100, sd = 10),
Credit_risk_Capital.new =rnorm(100, mean = 200, sd = 10),
NameConcentration.new = rnorm(100, mean = 40, sd = 10)
)
df %>%
gather("key", "value", -id) %>%
separate(key, c("CapitalChargeType", "new_old"), sep = "\\.") %>%
spread(new_old, value)
#> # A tibble: 200 x 4
#> id CapitalChargeType new old
#> * <int> <chr> <dbl> <dbl>
#> 1 1 Credit_risk_Capital 182.10955 405.78530
#> 2 1 NameConcentration 42.21037 99.44172
#> 3 2 Credit_risk_Capital 184.28810 370.14308
#> 4 2 NameConcentration 60.92340 120.13933
#> 5 3 Credit_risk_Capital 191.07982 389.50818
#> 6 3 NameConcentration 25.81776 90.91502
#> 7 4 Credit_risk_Capital 193.64247 327.56853
#> 8 4 NameConcentration 32.71050 94.95743
#> 9 5 Credit_risk_Capital 208.63547 286.59351
#> 10 5 NameConcentration 40.76064 116.52747
#> # ... with 190 more rows
答案 2 :(得分:0)
虽然这个问题很老,但更新的答案可能会帮助那些通过搜索定向到这个问题的人。在 data.table
的 most recent 开发版本中,measure
有一个新的 melt
函数,您可以从中执行:
df <-data.table(
id = c(1:100),
Credit_risk_Capital_old= rnorm(100, mean = 400, sd = 60),
NameConcentration_old= rnorm(100, mean = 100, sd = 10),
Credit_risk_Capital_New =rnorm(100, mean = 200, sd = 10),
NameConcentration_New = rnorm(100, mean = 40, sd = 10)
)
melt(df,
id.vars = "id",
measure(CapitalChargeType, value.name,
pattern = "(.*)_(New|old)"))
获取输出:
id CapitalChargeType old New
<int> <char> <num> <num>
1: 1 Credit_risk_Capital 409.89004 210.30058
2: 2 Credit_risk_Capital 403.15172 197.26172
3: 3 Credit_risk_Capital 374.90492 192.21152
4: 4 Credit_risk_Capital 509.17491 195.39095
5: 5 Credit_risk_Capital 429.48302 197.44441
---
196: 96 NameConcentration 80.64747 37.61926
197: 97 NameConcentration 104.39483 13.86576
198: 98 NameConcentration 106.87475 23.15775
199: 99 NameConcentration 112.92373 44.51562
200: 100 NameConcentration 111.80915 38.40075
新版本应该会在一段时间后出现在 CRAN 上,但在那之前,您可以使用开发版本。当版本移至 CRAN 时,我会尝试更新此答案。