Question

我有一个160列和> 30k行的数据框。我想缩放每列中的值，但诀窍在于每列都属于三组之一，并且缩放应在三组中每组的所有值上进行。

这里是一个例子：

data <- data.frame(cbind(apple.fruit=1:3, dog.pet=1:3, pear.fruit=10001:10003, cat.pet=11:13))

哪个会生成如下数据框：

apple.fruit    dog.pet    pear.fruit    cat.pet
          1          1         10001         11
          2          2         10002         12
          3          3         10003         13

我希望找到一种聪明的方法，找到所有带有“水果”一词的列，并在所有列中共同缩放所有水果值（并对“宠物”做同样的事情）并最终得到：

apple.fruit    dog.pet    pear.fruit    cat.pet
   -0.91305   -1.08112      0.91268     0.72075
   -0.91287   -0.90093      0.91287     0.90093
   -0.91268   -0.72075      0.91305     1.08112

说另一种方式：而不是用这种方式缩放apple.fruit：

scale(data$apple.fruit)

我希望以此方式进行缩放

scale(c(data$apple.fruit, data$pear.fruit))[1:3]

Answer 1

整理方法：将数据转换为“长”整理格式，按水果/宠物等分组，然后按组缩放

library(tidyverse)

data <- data.frame(cbind(apple.fruit=1:3, dog.pet=1:3, pear.fruit=10001:10003, cat.pet=11:13))
data.tidy <- data %>%
  gather(key="id",value = "value") %>%
  mutate(type = gsub(".*\\.(.*$)","\\1",id),
         name = gsub("(.*)\\..*$","\\1",id)) %>%
  group_by(type) %>%
  mutate(scaleit = scale(value))

data.tidy
#> # A tibble: 12 x 5
#> # Groups:   type [2]
#>    id          value type  name  scaleit
#>    <chr>       <int> <chr> <chr>   <dbl>
#>  1 apple.fruit     1 fruit apple  -0.913
#>  2 apple.fruit     2 fruit apple  -0.913
#>  3 apple.fruit     3 fruit apple  -0.913
#>  4 dog.pet         1 pet   dog    -1.08 
#>  5 dog.pet         2 pet   dog    -0.901
#>  6 dog.pet         3 pet   dog    -0.721
#>  7 pear.fruit  10001 fruit pear    0.913
#>  8 pear.fruit  10002 fruit pear    0.913
#>  9 pear.fruit  10003 fruit pear    0.913
#> 10 cat.pet        11 pet   cat     0.721
#> 11 cat.pet        12 pet   cat     0.901
#> 12 cat.pet        13 pet   cat     1.08

由reprex package（v0.2.0.9000）创建于2018-08-23。

Answer 2

将数据转换为长格式，然后一次缩放一列。这是一种使用data.table::melt的方法，它使您可以方便地根据命名模式同时融化多列。

library(data.table)
setDT(data)
roots = unique(sub(".*\\.", "", names(data)))
result = melt(data, measure.vars = patterns(roots))
setnames(result, old = paste0("value", 1:length(roots)), new = roots)
for (j in names(result)[-1]) set(result, j = j, value = scale(result[[j]]))
result
#    variable      fruit        pet
# 1:        1 -0.9130535 -1.0811250
# 2:        1 -0.9128709 -0.9009375
# 3:        1 -0.9126883 -0.7207500
# 4:        2  0.9126883  0.7207500
# 5:        2  0.9128709  0.9009375
# 6:        2  0.9130535  1.0811250

否则，我认为for循环非常简单：

data = as.data.frame(data) # in case you converted to data.table  above
roots = unique(sub(".*\\.", "", names(data)))

for (suffix in roots) {
  cols = grep(paste0(suffix, "$"), names(data))
  data[cols] = scale(unlist(data[cols]))
}
#   apple.fruit    dog.pet pear.fruit   cat.pet
# 1  -0.9130535 -1.0811250  0.9126883 0.7207500
# 2  -0.9128709 -0.9009375  0.9128709 0.9009375
# 3  -0.9126883 -0.7207500  0.9130535 1.0811250

跨多列扩展

2 个答案: