根据其他列条件在R中创建数据表的子集

时间:2018-04-05 06:05:52

标签: r dplyr data.table plyr

我想在R中创建以下candyData的子集,这样我应该按照品牌对数据进行分组,对于每个独特的品牌,我应该找到并打印A和B的最大值。为了说明新数据应该有品牌价值雀巢出现两次,相应的糖果价值A和B都出现一次对应雀巢及其在第三列中的最大值,同样适用于所有品牌价值。谢谢,请帮忙。

candyData <- read.table(
text = "
Brand       Candy           value
Nestle      A               12
Nestle      B               34
Nestle      A               32
Hershey's   A               55
Hershey's   B               14
Hershey's   B               19
Mars        B               24
Nestle      B               26
Nestle      A               28
Hershey's   B               23
Hershey's   B               23
Hershey's   A               65
Mars        A               23
Mars        B               34",
header = TRUE,
stringsAsFactors = FALSE)

5 个答案:

答案 0 :(得分:2)

试试这个:

library(dplyr)
candyData %>% 
  group_by(Brand, Candy) %>% 
  summarise(max=max(value))

输出将是:

# A tibble: 6 x 3
# Groups:   Brand [?]
  Brand     Candy   max
  <chr>     <chr> <dbl>
1 Hershey's A       65.
2 Hershey's B       23.
3 Mars      A       23.
4 Mars      B       34.
5 Nestle    A       32.
6 Nestle    B       34.

答案 1 :(得分:2)

aggregate(value ~ ., candyData, max)

这个candyData分组BrandCandy(因为它们都是value以外的所有列; .执行此操作)并提供{{1每组的max

答案 2 :(得分:1)

再加上几个解决方案:

cd <- read.table(
    text = "
    Brand       Candy           value
    Nestle      A               12
    Nestle      B               34
    Nestle      A               32
    Hershey's   A               55
    Hershey's   B               14
    Hershey's   B               19
    Mars        B               24
    Nestle      B               26
    Nestle      A               28
    Hershey's   B               23
    Hershey's   B               23
    Hershey's   A               65
    Mars        A               23
    Mars        B               34",
    header = TRUE,
    stringsAsFactors = FALSE)

#using split + lapply or equivalently, by
c(by(cd$value, paste(cd$Brand, cd$Candy), max))

#using tapply i.e. apply to each group
tapply(cd$value, paste(cd$Brand, cd$Candy), max)

#using data.table
library(data.table)
setDT(cd)[, .(Max=max(value)), by=.(Brand, Candy)]

#using sqldf
library(sqldf)
sqldf("select Brand, Candy, max(value) as Max from cd group by Brand, Candy")

答案 3 :(得分:0)

虽然我的答案远不如使用dplyr那样优雅,但我使用基础R创建了一个解决方案。

splittedData <- split(candyData,candyData$Brand)
resultDf <- data.frame(matrix(ncol = 3))
colnames(resultDf) <- c("Brand", "Candy", "maxValue")
insertIndex<-1
for(dfIndex in 1:length(splittedData)) {
  tempDf <- splittedData[[dfIndex]]
  tableDf <- data.frame(table(tempDf$Candy))
  tableDf[,1] <- as.character(tableDf[,1])
  for(i in 1:nrow(tableDf)) {
    resultDf[insertIndex, 1] <- tempDf$Brand[1]
    resultDf[insertIndex, 2] <- tableDf[i,1]
    resultDf[insertIndex, 3] <- max(tempDf$value[tempDf$Candy==tableDf[i,1]])
    insertIndex <- insertIndex + 1
  }
}

输出是一个新的df:

  Brand     Candy maxValue
1 Hershey's     A       65
2 Hershey's     B       23
3      Mars     A       23
4      Mars     B       34
5    Nestle     A       32
6    Nestle     B       34

答案 4 :(得分:0)

使用提供的示例数据和data.table

library(data.table)
setDT(candyData)
candyData[,.(Max = max(value)), keyby = .(Brand,Candy)]

给出

       Brand Candy Max
1: Hershey's     A  65
2: Hershey's     B  23
3:      Mars     A  23
4:      Mars     B  34
5:    Nestle     A  32
6:    Nestle     B  34