我有一个160列和> 30k行的数据框。我想缩放每列中的值,但诀窍在于每列都属于三组之一,并且缩放应在三组中每组的所有值上进行。
这里是一个例子:
data <- data.frame(cbind(apple.fruit=1:3, dog.pet=1:3, pear.fruit=10001:10003, cat.pet=11:13))
哪个会生成如下数据框:
apple.fruit dog.pet pear.fruit cat.pet
1 1 10001 11
2 2 10002 12
3 3 10003 13
我希望找到一种聪明的方法,找到所有带有“水果”一词的列,并在所有列中共同缩放所有水果值(并对“宠物”做同样的事情)并最终得到:
apple.fruit dog.pet pear.fruit cat.pet
-0.91305 -1.08112 0.91268 0.72075
-0.91287 -0.90093 0.91287 0.90093
-0.91268 -0.72075 0.91305 1.08112
说另一种方式:而不是用这种方式缩放apple.fruit:
scale(data$apple.fruit)
我希望以此方式进行缩放
scale(c(data$apple.fruit, data$pear.fruit))[1:3]
答案 0 :(得分:1)
整理方法:将数据转换为“长”整理格式,按水果/宠物等分组,然后按组缩放
library(tidyverse)
data <- data.frame(cbind(apple.fruit=1:3, dog.pet=1:3, pear.fruit=10001:10003, cat.pet=11:13))
data.tidy <- data %>%
gather(key="id",value = "value") %>%
mutate(type = gsub(".*\\.(.*$)","\\1",id),
name = gsub("(.*)\\..*$","\\1",id)) %>%
group_by(type) %>%
mutate(scaleit = scale(value))
data.tidy
#> # A tibble: 12 x 5
#> # Groups: type [2]
#> id value type name scaleit
#> <chr> <int> <chr> <chr> <dbl>
#> 1 apple.fruit 1 fruit apple -0.913
#> 2 apple.fruit 2 fruit apple -0.913
#> 3 apple.fruit 3 fruit apple -0.913
#> 4 dog.pet 1 pet dog -1.08
#> 5 dog.pet 2 pet dog -0.901
#> 6 dog.pet 3 pet dog -0.721
#> 7 pear.fruit 10001 fruit pear 0.913
#> 8 pear.fruit 10002 fruit pear 0.913
#> 9 pear.fruit 10003 fruit pear 0.913
#> 10 cat.pet 11 pet cat 0.721
#> 11 cat.pet 12 pet cat 0.901
#> 12 cat.pet 13 pet cat 1.08
由reprex package(v0.2.0.9000)创建于2018-08-23。
答案 1 :(得分:0)
将数据转换为长格式,然后一次缩放一列。这是一种使用data.table::melt
的方法,它使您可以方便地根据命名模式同时融化多列。
library(data.table)
setDT(data)
roots = unique(sub(".*\\.", "", names(data)))
result = melt(data, measure.vars = patterns(roots))
setnames(result, old = paste0("value", 1:length(roots)), new = roots)
for (j in names(result)[-1]) set(result, j = j, value = scale(result[[j]]))
result
# variable fruit pet
# 1: 1 -0.9130535 -1.0811250
# 2: 1 -0.9128709 -0.9009375
# 3: 1 -0.9126883 -0.7207500
# 4: 2 0.9126883 0.7207500
# 5: 2 0.9128709 0.9009375
# 6: 2 0.9130535 1.0811250
否则,我认为for
循环非常简单:
data = as.data.frame(data) # in case you converted to data.table above
roots = unique(sub(".*\\.", "", names(data)))
for (suffix in roots) {
cols = grep(paste0(suffix, "$"), names(data))
data[cols] = scale(unlist(data[cols]))
}
# apple.fruit dog.pet pear.fruit cat.pet
# 1 -0.9130535 -1.0811250 0.9126883 0.7207500
# 2 -0.9128709 -0.9009375 0.9128709 0.9009375
# 3 -0.9126883 -0.7207500 0.9130535 1.0811250