R:使用每行具有多个分类值的属性拆分data.frame以计算度量

时间:2013-11-24 04:21:09

标签: r

提供以下数据:

> data <- data.frame("randomData"=rnorm(5), "category"=c("A, B","A","C, A","B, C","B"))

  randomData category
1 -0.4963843     A, B
2  1.6351726        A
3 -1.6209544     C, A
4  1.4167151     B, C
5  1.6380250        B

我的目标是在randomData列上应用一个函数来计算类别列中每个类别(A,B,C)的一些度量。目前,类别列包含多个类别,这当然会产生以下结果:

> by(data[,1], data[,"category"], sum)
data[, "category"]: A
[1] 1.635173
----------------------------------------------------------------------------------------------- 
data[, "category"]: A, B
[1] -0.4963843
----------------------------------------------------------------------------------------------- 
data[, "category"]: B
[1] 1.638025
----------------------------------------------------------------------------------------------- 
data[, "category"]: B, C
[1] 1.416715
----------------------------------------------------------------------------------------------- 
data[, "category"]: C, A
[1] -1.620954

现在,我可以通过这种方式获得唯一的类别值(和新级别):

> levels <- levels(as.factor(unlist(strsplit(levels(data[,"category"]),", "))))

我可以选择相对于新级别的值。

> data[which(grepl(levels[1], data$category)), ]
  randomData category
1 -0.4963843     A, B
2  1.6351726        A
3 -1.6209544     C, A

在下一步中,我将构建一个循环,以便对每个新级别重复此过程,并最终计算每个类别的值(例如总和)。但是,是否有更好的(无循环)方法根据这些类别拆分数据并为这些组计算某些度量?

感谢您的建议!

2 个答案:

答案 0 :(得分:2)

您也可以使用相当简单的语法尝试此操作:

library(splitstackshape)

# slightly 'simpler' randomData
# to make it easier to check if this gives the desired results
df <- data.frame(randomData = 1:5,
                  category = c("A, B", "A", "C, A", "B, C", "B"))
df
#   randomData category
# 1          1     A, B
# 2          2        A
# 3          3     C, A
# 4          4     B, C
# 5          5        B    

# split the concatenated column, and reshape from wide to long format
df2 <- concat.split.multiple(data = df, split.cols = "category", direction = "long")
df2

# calculate sum per category
aggregate(randomData ~ category, data = df2, sum)
#   category randomData
# 1        A          6
# 2        B         10
# 3        C          7

答案 1 :(得分:1)

apply和朋友通常是避免在顶层循环的好方法。

这是一种方法:

# Generate the data 
set.seed(100)
data <- data.frame("randomData"=rnorm(5), 
                   "category"=c("A, B","A","C, A","B, C","B"))
# Grab the unique categories
categories <- unique(unlist(sapply(levels(data$category), 
                     function(x){strsplit(x, ", ")})))
# Use sapply to process each category separately
sums <- sapply(categories, 
               function(x){sum(data[grep(x, data$category), "randomData"])})

结果:

> data
   randomData category
1 -0.50219235     A, B
2  0.13153117        A
3 -0.07891709     C, A
4  0.88678481     B, C
5  0.11697127        B
> sums
         A          B          C 
-0.4495783  0.5015637  0.8078677 

现在您可以以任何方式测量数据(按类别) - 只需替换第三行中的sum函数。