如何根据每列的前四分位数对数据框进行子集化?

时间:2018-01-09 02:11:01

标签: r dataframe dplyr

#let's make some sample data first
names<- c("t1","t2","t3","t4","t5","t1","t2","t3","t4","t5","t1","t2","t3","t4","t5")
metric1_set1 <- c(2.5,3.1,4.5,2.5,12,7.1,8.5,10,10.1,17.8,12.3,11,10,14,1.5) 
metric1_set2 <- c(2.1,3.1,4.15,2.5,10,7.1,8.5,10,10.1,17.1,12.3,17.3,8,11,1.5) 
metric1_set3 <- c(12.1,13.1,4.15,2.5,10.5,7.1,2.5,10,7.1,11.1,12.3,17.3,8,1.45,1.5) 
dataset1 <- data.frame(names,metric1_set1,metric1_set2,metric1_set3)


names<- c("t1","t2","t3","t4","t5","t1","t2","t3","t4","t5","t1","t2","t3","t4","t5")
metric2_set1 <- c(21.5,13.1,4.5,2.5,12,7.1,8.5,10,10.1,17.8,12.3,11,10,14,1.5) 
metric2_set2 <- c(12.1,3.1,4.15,2.5,10,7.1,8.5,10,8.1,17.1,12.3,17.3,8,1.1,1.5) 
metric2_set3 <- c(2.1,13.1,4.15,2.5,10.5,7.1,21.5,10,7.1,11.1,12.3,12.3,8,1.45,1.5) 
dataset2 <- data.frame(names,metric2_set1,metric2_set2,metric2_set3)

现在的问题是计算每列dataset1的最高四分位数,然后从dataset2中提取相应的名称。想法是获得这些子集值之间的相关性。

quantiles <- apply(dataset1[2:4], 2, quantile, na.rm = TRUE)

将获得四分位数,但实际问题是如何保存与让我们说一个数据集的顶级变量相关联的名称,并从两个数据集中删除每隔一行。

根据@sconfluentus的建议,我们可以将其更改为:

 topQuartile<-function(x){   #the function 
 y=quantile(x,  na.rm = TRUE )
 z=y[3]
 return(z)
 }
 quartile_daatset1<- apply( dataset1[2:4] , 2 , topQuartile  )

这完全有效,但我也需要类似以下内容:

 topquartile_set1 <- subset(dataset1$metric1_set1, subset=(dataset1$metric1_set1 <= quant_daatset1[1]))

我需要类似的代码,适用于每一列,并将所有子集放在一个最终的数据框中。

2 个答案:

答案 0 :(得分:0)

最简单的方法是在其中构建一个带有quantile的函数,在该函数中提取第五个分位数并将其返回到apply,如下所示:

fifthQuantile<-function(x){
  y=quantile(x,  na.rm = TRUE )
  z=y[5]
  return(z)
}

 quantiles<- apply( dataset1[2:4] , 2 , fifthQuantile )

这将返回一个数据框,其中旧的列名称为行名称。如果您希望它们以另一种方式塑造,请尝试:

quantiles<- t(apply( dataset1[2:4] , 2 , fifthQuantile ))

这为您提供了一个转置数据框,其中的列位于原始数据框中!

答案 1 :(得分:0)

我首先要使用tidyr包收集数据:

library(tidyr)
df.gathered = gather(dataset1, key = "category", value = "value", -names)

结果:

names  category    value
--------------------------
 t1 metric1_set1  2.50
 t2 metric1_set1  3.10
 t3 metric1_set1  4.50
 t4 metric1_set1  2.50
 t5 metric1_set1 12.00
 t1 metric1_set1  7.10
 t2 metric1_set1  8.50
 t3 metric1_set1 10.00
 t4 metric1_set1 10.10
 t5 metric1_set1 17.80 
 ...  # and similar rows for metric1_set2 and metric1_set3 ...

然后,您可以使用group_by中的dplyr功能获取每个名称和类别的最高分位数:

library(dplyr)
df.gathered %>% group_by(names, category) %>% summarise(Q1 = quantile(value, 1))

names   category    Q1
----------------------------
  t1 metric1_set1  12.3
  t1 metric1_set2  12.3
  t1 metric1_set3  12.3
  t2 metric1_set1  11.0
  t2 metric1_set2  17.3
  t2 metric1_set3  17.3
  ...