在使用dplyr-mutate_at时,我难以在自定义函数中传递列名。 我有一个包含数千个列的数据集“ dt”,我想对其中的某些列执行mutate,但要以某种方式依赖于列名
我有这段代码
选项1:
relevantcols = c("A", "B", "C")
myfunc <- function(colname, x) {
#write different logic per column name
}
dt%>%
mutate_at(relevantcols, funs(myfunc(<what should i give?>,.)))
我尝试以另一种方式解决该问题,即通过迭代相关cols并对向量的每个元素应用mutate_at,如下所示
选项2:
for (i in 1:length(relevantcols)){
dt%>%
mutate_at(relevantcols[i], funs(myfunc(relevantcols[i], .))
}
我在选项2中获得了列名,但它比选项1慢了10倍。我能以某种方式获得选项1中的列名吗?
添加示例以更加清晰
df = data.frame(employee=seq(1:5), Mon_channelA=runif(5,1,10), Mon_channelB=runif(5,1,10), Tue_channelA=runif(5,1,10),Tue_channelB=runif(5,1,10))
df
employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
1 1 5.234383 6.857227 4.480943 7.233947
2 2 7.441399 3.777524 2.134075 6.310293
3 3 7.686558 8.598688 9.814882 9.192952
4 4 6.033345 5.658716 5.167388 3.018563
5 5 5.595006 7.582548 9.302917 6.071108
relevantcols = c("Mon_channelA", "Mon_channelB")
myfunc <- function(colname, x) {
#based on the channel and weekday, compare the data from corresponding column with the same channel but different weekday and return T if higher else F
}
# required output
employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
1 1 T F 4.480943 7.233947
2 2 T F 2.134075 6.310293
3 3 F F 9.814882 9.192952
4 4 T T 5.167388 3.018563
5 5 F T 9.302917 6.071108
答案 0 :(得分:0)
您可以执行以下操作:
L <- c("A","B")
df <- data.frame(A=rep(1:3,2),B=1:6,C=7:12)
df
# A B C
#1 1 1 7
#2 2 2 8
#3 3 3 9
#4 1 4 10
#5 2 5 11
#6 3 6 12
f <- function(x,y) x^y
df %>% mutate_at(L,funs(f(.,2)))
# A B C
#1 1 1 7
#2 4 4 8
#3 9 9 9
#4 1 16 10
#5 4 25 11
#6 9 36 12
答案 1 :(得分:0)
我对数据类型发表了评论,但是假设您正在寻找数据类型,那么这就是我针对这类问题采取的方法。我在一次似乎很复杂的重塑过程中执行了此操作,但是它使您可以设置要比较的变量,而无需进行大量的硬编码。我会把它弄成碎片。
[[1538140080000, 92881926.0],
[1538140140000, 92881926.0],
[1538140200000, 92881926.0],
[1538140260000, 92881926.0],
[1538140320000, 92881926.0],
[1538140380000, 92881926.0],
[1538140440000, 92881926.0]]
首先,我将其重塑为长形并将“ library(tidyverse)
set.seed(928)
df <- data.frame(employee=seq(1:5), Mon_channelA=runif(5,1,10), Mon_channelB=runif(5,1,10), Tue_channelA=runif(5,1,10),Tue_channelB=runif(5,1,10))
”等分解为一天和一个通道。这样,您就可以使用通道名称来匹配值以进行比较。
Mon_channelA
然后,根据日期将其恢复为多种格式。现在,您每天都有一栏,用于员工和渠道的每种组合。
df %>%
gather(key, value, -employee) %>%
separate(key, into = c("day", "channel"), sep = "_") %>%
head()
#> employee day channel value
#> 1 1 Mon channelA 2.039619
#> 2 2 Mon channelA 8.153684
#> 3 3 Mon channelA 9.027932
#> 4 4 Mon channelA 1.161967
#> 5 5 Mon channelA 3.583353
#> 6 1 Mon channelB 7.102797
然后进行比较,然后再次获取数据。请注意,由于df %>%
gather(key, value, -employee) %>%
separate(key, into = c("day", "channel"), sep = "_") %>%
spread(key = day, value = value) %>%
head()
#> employee channel Mon Tue
#> 1 1 channelA 2.039619 9.826677
#> 2 1 channelB 7.102797 7.388568
#> 3 2 channelA 8.153684 5.848375
#> 4 2 channelB 6.299178 9.452274
#> 5 3 channelA 9.027932 5.458906
#> 6 3 channelB 7.029408 7.087011
列具有数字值,因此所有内容都变为数字,并且逻辑值将转换为1或0。
value
最后几步是将日期和频道重新组合在一起,使标签如您所愿,展开为宽格式,然后将以df %>%
gather(key, value, -employee) %>%
separate(key, into = c("day", "channel"), sep = "_") %>%
spread(key = day, value = value) %>%
mutate(Mon = Mon > Tue) %>%
gather(key = day, value = value, Mon, Tue) %>%
head()
#> employee channel day value
#> 1 1 channelA Mon 0
#> 2 1 channelB Mon 0
#> 3 2 channelA Mon 1
#> 4 2 channelB Mon 0
#> 5 3 channelA Mon 1
#> 6 3 channelB Mon 0
开头的所有列都转换为逻辑。
"Mon"
由reprex package(v0.2.1)于2018-09-28创建
答案 2 :(得分:0)
这是一个古老的问题,但是我偶然发现了一种使用自定义mutate/case_when
函数与purrr::reduce
结合使用的解决方法。
在mutate/case_when
语句中使用非标准评估(NSE)来匹配自定义函数所需的变量名非常重要。
我不知道一种与mutate_at
类似的方法。
下面,我提供两个示例,一个最基本的形式(使用您的原始数据),另一个更高级的版本(包含三个工作日和两个频道,并且)创建了两个以上的变量。后者需要使用switch
这样的初始设置。
library(tidyverse)
# your data
df <- data.frame(employee=seq(1:5),
Mon_channelA=runif(5,1,10),
Mon_channelB=runif(5,1,10),
Tue_channelA=runif(5,1,10),
Tue_channelB=runif(5,1,10)
)
# custom function which takes two arguments, df and a string variable name
myfunc <- function(df, x) {
mutate(df,
# overwrites all "Mon_channel" variables ...
!! paste0("Mon_", x) := case_when(
# ... with TRUE, when Mon_channel is smaller than Tue_channel, and FALSE else
!! sym(paste0("Mon_", x)) < !! sym(paste0("Tue_", x)) ~ T,
T ~ F
)
)
}
# define the variables you want to loop over
var_ls <- c("channelA", "channelB")
# use var_ls and myfunc with reduce on your data
df %>%
reduce(var_ls, myfunc, .init = .)
#> employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
#> 1 1 FALSE FALSE 3.437975 2.458389
#> 2 2 FALSE TRUE 3.686903 4.772390
#> 3 3 TRUE TRUE 5.158234 5.378021
#> 4 4 TRUE TRUE 5.338950 3.109760
#> 5 5 TRUE FALSE 6.365173 3.450495
由reprex package(v0.3.0)于2020-02-03创建
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 3.5.2
#> Warning: package 'purrr' was built under R version 3.5.2
#> Warning: package 'forcats' was built under R version 3.5.2
# your data plus one weekday with two channels
df <- data.frame(employee=seq(1:5),
Mon_channelA=runif(5,1,10),
Mon_channelB=runif(5,1,10),
Tue_channelA=runif(5,1,10),
Tue_channelB=runif(5,1,10),
Wed_channelA=runif(5,1,10),
Wed_channelB=runif(5,1,10)
)
# custom function which takes two argument, df and a string variable name
myfunc <- function(df, x) {
# an initial set-up is needed
# id gets the original day
id <- str_extract(x, "^\\w{3}")
# based on id the day of comparison is mapped with switch
y <- switch(id,
"Mon" = "Tue",
"Tue" = "Wed")
# j extracts the channel name including the underscore
j <- str_extract(x, "_channel[A-Z]{1}")
# this makes the function definition rather easy:
mutate(df,
!! x := case_when(
!! sym(x) < !! sym(paste0(y, j)) ~ T,
T ~ F
)
)
}
# define the variables you want to loop over
var_ls <- c("Mon_channelA",
"Mon_channelB",
"Tue_channelA",
"Tue_channelB")
# use var_ls and myfunc with reduce on your data
df %>%
reduce(var_ls, myfunc, .init = .)
#> employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
#> 1 1 TRUE TRUE TRUE FALSE
#> 2 2 FALSE TRUE TRUE FALSE
#> 3 3 FALSE TRUE FALSE TRUE
#> 4 4 FALSE TRUE TRUE FALSE
#> 5 5 TRUE FALSE FALSE FALSE
#> Wed_channelA Wed_channelB
#> 1 9.952454 5.634686
#> 2 9.356577 4.514683
#> 3 2.721330 7.107316
#> 4 4.410240 2.740289
#> 5 5.394057 4.772162
由reprex package(v0.3.0)于2020-02-03创建