我有一个数据表,其中许多变量已分为正面和负面组件。我想组合这些列,以便存在变量的有符号值。 (这些变量在名称中总是有positive
和negative
,而其他变量都没有。但是,positive
和negative
子字符串可能出现在变量的任何位置 - - ie 只有grepl("(positive)|(negative)", names(dt))
才能正确识别它们。)
例如,
library(data.table)
set.seed(1)
(DT <- data.table(x = 1:5,
a_positive = sample(1:5),
a_negative = sample(1:5),
b_positive = sample(1:5),
b_negative = sample(1:5),
c_normal = sample(1:5)))
x a_positive a_negative b_positive b_negative c_normal
1: 1 2 5 2 3 5
2: 2 5 4 1 5 1
3: 3 4 2 3 4 2
4: 4 3 3 4 1 4
5: 5 1 1 5 2 3
预期结果:
x c_normal a b
1: 1 5 -3 -1
2: 2 1 1 -4
3: 3 2 2 -1
4: 4 4 0 3
5: 5 3 0 3
我的做法依赖于for
循环和dplyr
:
library(dplyr)
library(lazyeval)
library(magrittr)
unite_positive_negative <- function(dt){
signed_names <-
names(dt)[
duplicated(gsub("(positive)|(negative)", "", names(dt))) |
duplicated(gsub("(positive)|(negative)", "", names(dt)), fromLast = TRUE)]
unsigned_names <-
gsub("_*((positive)|(negative))_*", "", signed_names)
the_names <-
data.table(signed_names = signed_names,
unsigned_names = unsigned_names)
for (unsigned_name in unsigned_names){
poz <- the_names[unsigned_names == unsigned_name & grepl("positive", signed_names, fixed = TRUE)][["signed_names"]]
neg <- the_names[unsigned_names == unsigned_name & grepl("negative", signed_names, fixed = TRUE)][["signed_names"]]
dt %<>%
mutate_(.dots = setNames(list(interp(~p - n, p = as.name(poz), n = as.name(neg))), unsigned_name))
}
# Unimportant
unselect_ <- function(.data, .dots){
all_names <- names(.data)
keeps <- names(.data)[!names(.data) %in% .dots]
dplyr::select_(.data, .dots = keeps)
}
dt %>%
unselect_(.dots = signed_names)
}
是否有纯粹的data.table
方式? (或者更直接的方式)?
答案 0 :(得分:1)
我们可以尝试使用melt/dcast
。重塑“广泛”的数据集。长期&#39;格式为melt
,将id.var
指定为&#39; x&#39;和&#39; c_normal&#39;列(如果有很多&#39;普通&#39;列,我们也可以使用grep
来实现这一目标。使用tstrsplit
将&#39;变量&#39;列拆分为两列。由&#39; x&#39;&#39; c_normal&#39;和&#39; var1&#39;(来自split
)分组,我们将&#34;否定&#34;和&#34;积极&#34;价值&#39;,将它们与-1/1
相乘并将它们加在一起。然后,dcast
从&#39; long&#39;到&#39; 39;广泛的格式。
library(data.table)
dcast(melt(DT, id.var = c("x", "c_normal"))[,
c("var1", "var2") := tstrsplit(variable, "_")
][, -1*value[var2=="negative"] + value[var2=="positive"] ,
by = .(x, c_normal, var1)],
x + c_normal~var1, value.var="V1")
# x c_normal a b
#1: 1 5 -3 -1
#2: 2 1 1 -4
#3: 3 2 2 -1
#4: 4 4 0 3
#5: 5 3 0 3
没有melt/dcast
的另一个选项是将数据集子集为&#34; positive&#34;和&#34;否定&#34;列(假设它们是有序的),乘以1/-1
,进行加法(+
)并将这些输出分配给数据集的子集,而不使用&#34;正/负&#34;列。
DT1 <- DT[, c("x", grep("normal", names(DT), value=TRUE)), with = FALSE]
DT2 <- DT[, grep("positive", names(DT)), with = FALSE] +
-1 * DT[, grep("negative", names(DT)), with = FALSE]
DT1[, c("a", "b") := DT2]
DT1
# x c_normal a b
# 1: 1 5 -3 -1
# 2: 2 1 1 -4
# 3: 3 2 2 -1
# 4: 4 4 0 3
# 5: 5 3 0 3