根据不同的数据集对新列进行突变

时间:2020-10-06 14:39:13

标签: r dplyr match mutate

示例我有三个数据集: df1_mean (基于df1的每个变量的平均值), df1_sd (基于df1的每个变量的sd)和 df2 < / strong>(df2的值)。

df1_mean:

  A_mean B_mean C_mean D_mean E_mean
1     10     15     12     25     29

df1_sd:

  A_sd B_sd C_sd D_sd E_sd
1    3    2    5    4    2

df2:

  A  B  C  D  E
1 20 32 12 14 22
2 21 35 14 52 13
3 25 23 21 32 35
4 23 12 11 52 21
5 20 53 43 12 64
6 30 12 23 53 31

理想情况下,我想将 df1 中的*_mean*_sd中的每个变量(即分别为A,B,C,D,E)匹配> df2 ,然后mutate()根据公式创建一个新列,并为每个列输出新列。

对于每个变量,最终结果应类似于:

df2$A_output = (df2$A - df1$A_mean) / df1$A_sd

有人会知道是否有一种方法可以使用来自不同数据集的数据来mutate()个新列吗?还是最简单的自动化方法,而不是使用A_output = (A-10)/3, B_output = (B-15)/2, ...手动进行?谢谢!

4 个答案:

答案 0 :(得分:3)

以下是一些基本的R选项:

  • 使用rep
dfout <- (df2 - df1_mean[rep(1,nrow(df2)),])/df1_sd[rep(1,nrow(df2)),]
  • 使用sweep
dfout <- sweep(sweep(df2,2,unlist(df1_mean)),2,unlist(df1_sd),FUN = `/`)

两者都会给

> dfout
         A    B    C     D    E
1 3.333333  8.5  0.0 -2.75 -3.5
2 3.666667 10.0  0.4  6.75 -8.0
3 5.000000  4.0  1.8  1.75  3.0
4 4.333333 -1.5 -0.2  6.75 -4.0
5 3.333333 19.0  6.2 -3.25 17.5
6 6.666667 -1.5  2.2  7.00  1.0

数据

> dput(df1_mean)
structure(list(A_mean = 10L, B_mean = 15L, C_mean = 12L, D_mean = 25L,
    E_mean = 29L), class = "data.frame", row.names = "1")

> dput(df1_sd)
structure(list(A_sd = 3L, B_sd = 2L, C_sd = 5L, D_sd = 4L, E_sd = 2L), class = "data.frame", row.names = "1")

> dput(df2)
structure(list(A = c(20L, 21L, 25L, 23L, 20L, 30L), B = c(32L,
35L, 23L, 12L, 53L, 12L), C = c(12L, 14L, 21L, 11L, 43L, 23L),
    D = c(14L, 52L, 32L, 52L, 12L, 53L), E = c(22L, 13L, 35L, 
    21L, 64L, 31L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

答案 1 :(得分:3)

尝试一下

as.data.frame(Map(function(x, mu, sig) (x - mu) / sig, df2, df1_mean, df1_sd))

输出

         A    B    C     D    E
1 3.333333  8.5  0.0 -2.75 -3.5
2 3.666667 10.0  0.4  6.75 -8.0
3 5.000000  4.0  1.8  1.75  3.0
4 4.333333 -1.5 -0.2  6.75 -4.0
5 3.333333 19.0  6.2 -3.25 17.5
6 6.666667 -1.5  2.2  7.00  1.0

答案 2 :(得分:3)

这是使用向量化数学和一些转置来进行回收工作的一种方法:

t( (t(df2) - unlist(df1_mean)) / unlist(df1_sd) )
#          A    B    C     D    E
# 1 3.333333  8.5  0.0 -2.75 -3.5
# 2 3.666667 10.0  0.4  6.75 -8.0
# 3 5.000000  4.0  1.8  1.75  3.0
# 4 4.333333 -1.5 -0.2  6.75 -4.0
# 5 3.333333 19.0  6.2 -3.25 17.5
# 6 6.666667 -1.5  2.2  7.00  1.0

它依赖于三个数据帧的列以相应的顺序。只要这成立,那它就会非常有效率。

答案 3 :(得分:1)

尝试这种tidyverse方法:

library(tidyverse)
#Code
Output <- df2 %>% mutate(id=1:n()) %>% pivot_longer(-id) %>%
  left_join(df1_mean %>% pivot_longer(everything()) %>%
              separate(name,c('name','Var'),sep='_') %>%
              rename(Mean=value) %>% select(-Var)
  ) %>%
  left_join(
    df1_sd %>% pivot_longer(everything()) %>%
      separate(name,c('name','Var'),sep='_') %>%
      rename(SD=value) %>% select(-Var)
  ) %>% mutate(Val=(value-Mean)/SD) %>% select(-c(value,Mean,SD)) %>%
  pivot_wider(names_from = name,values_from=Val) %>% select(-id)

输出:

# A tibble: 6 x 5
      A     B     C     D     E
  <dbl> <dbl> <dbl> <dbl> <dbl>
1  3.33   8.5   0   -2.75  -3.5
2  3.67  10     0.4  6.75  -8  
3  5      4     1.8  1.75   3  
4  4.33  -1.5  -0.2  6.75  -4  
5  3.33  19     6.2 -3.25  17.5
6  6.67  -1.5   2.2  7      1  

使用了一些数据:

#Data 1
df1_mean <- structure(list(A_mean = 10L, B_mean = 15L, C_mean = 12L, D_mean = 25L,E_mean = 29L), class = "data.frame", row.names = "1")

#Data 2
df1_sd <-structure(list(A_sd = 3L, B_sd = 2L, C_sd = 5L, D_sd = 4L, E_sd = 2L), class = "data.frame", row.names = "1")

#Data 3
df2 <- structure(list(A = c(20L, 21L, 25L, 23L, 20L, 30L), B = c(32L, 
35L, 23L, 12L, 53L, 12L), C = c(12L, 14L, 21L, 11L, 43L, 23L), 
    D = c(14L, 52L, 32L, 52L, 12L, 53L), E = c(22L, 13L, 35L, 
    21L, 64L, 31L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))