提供以下数据:
> data <- data.frame("randomData"=rnorm(5), "category"=c("A, B","A","C, A","B, C","B"))
randomData category
1 -0.4963843 A, B
2 1.6351726 A
3 -1.6209544 C, A
4 1.4167151 B, C
5 1.6380250 B
我的目标是在randomData列上应用一个函数来计算类别列中每个类别(A,B,C)的一些度量。目前,类别列包含多个类别,这当然会产生以下结果:
> by(data[,1], data[,"category"], sum)
data[, "category"]: A
[1] 1.635173
-----------------------------------------------------------------------------------------------
data[, "category"]: A, B
[1] -0.4963843
-----------------------------------------------------------------------------------------------
data[, "category"]: B
[1] 1.638025
-----------------------------------------------------------------------------------------------
data[, "category"]: B, C
[1] 1.416715
-----------------------------------------------------------------------------------------------
data[, "category"]: C, A
[1] -1.620954
现在,我可以通过这种方式获得唯一的类别值(和新级别):
> levels <- levels(as.factor(unlist(strsplit(levels(data[,"category"]),", "))))
我可以选择相对于新级别的值。
> data[which(grepl(levels[1], data$category)), ]
randomData category
1 -0.4963843 A, B
2 1.6351726 A
3 -1.6209544 C, A
在下一步中,我将构建一个循环,以便对每个新级别重复此过程,并最终计算每个类别的值(例如总和)。但是,是否有更好的(无循环)方法根据这些类别拆分数据并为这些组计算某些度量?
感谢您的建议!
答案 0 :(得分:2)
您也可以使用相当简单的语法尝试此操作:
library(splitstackshape)
# slightly 'simpler' randomData
# to make it easier to check if this gives the desired results
df <- data.frame(randomData = 1:5,
category = c("A, B", "A", "C, A", "B, C", "B"))
df
# randomData category
# 1 1 A, B
# 2 2 A
# 3 3 C, A
# 4 4 B, C
# 5 5 B
# split the concatenated column, and reshape from wide to long format
df2 <- concat.split.multiple(data = df, split.cols = "category", direction = "long")
df2
# calculate sum per category
aggregate(randomData ~ category, data = df2, sum)
# category randomData
# 1 A 6
# 2 B 10
# 3 C 7
答案 1 :(得分:1)
apply
和朋友通常是避免在顶层循环的好方法。
这是一种方法:
# Generate the data
set.seed(100)
data <- data.frame("randomData"=rnorm(5),
"category"=c("A, B","A","C, A","B, C","B"))
# Grab the unique categories
categories <- unique(unlist(sapply(levels(data$category),
function(x){strsplit(x, ", ")})))
# Use sapply to process each category separately
sums <- sapply(categories,
function(x){sum(data[grep(x, data$category), "randomData"])})
结果:
> data
randomData category
1 -0.50219235 A, B
2 0.13153117 A
3 -0.07891709 C, A
4 0.88678481 B, C
5 0.11697127 B
> sums
A B C
-0.4495783 0.5015637 0.8078677
现在您可以以任何方式测量数据(按类别) - 只需替换第三行中的sum
函数。