在分组值中从R表中提取值

时间:2012-12-07 17:10:38

标签: r

我有下表排序的第一,第二和名称组。

    myData <- structure(list(first = c(120L, 120L, 126L, 126L, 126L, 132L, 132L), second = c(1.33, 1.33, 0.36, 0.37, 0.34, 0.46, 0.53), 
    Name = structure(c(5L, 5L, 3L, 3L, 4L, 1L, 2L), .Label = c("Benzene", 
    "Ethene._trichloro-", "Heptene", "Methylamine", "Pentanone"
    ), class = "factor"), Area = c(699468L, 153744L, 32913L, 
    4948619L, 83528L, 536339L, 105598L), Sample = structure(c(3L, 
    2L, 3L, 3L, 3L, 1L, 1L), .Label = c("PO1:1", "PO2:1", "PO4:1"
    ), class = "factor")), .Names = c("first", "second", "Name", 
    "Area", "Sample"), class = "data.frame", row.names = c(NA, -7L))

在每组中,我想提取对应于特定样本的区域。有几个组没有来自样本的区域,因此如果未检测到样本,则应返回“NA”。理想情况下,最终输出应为每个样本的列。

我已尝试使用ifelse函数为每个样本创建一列:

PO1<-ifelse(myData$Sample=="PO1:1",myData$Area, "NA")

然而,这并未考虑群组分布。我想这样做,但在小组内。如果sample = PO1:1,则为每个组(一组为first,second和Name列的值相等),否则为NA。

对于第一组:

structure(list(first = c(120L, 120L), second = c(1.33, 1.33), 
Name = structure(c(1L, 1L), .Label = "Pentanone", class = "factor"), 
Area = c(699468L, 153744L), Sample = structure(c(2L, 1L), .Label = c("PO2:1", 
"PO4:1"), class = "factor")), .Names = c("first", "second", "Name", 
"Area", "Sample"), class = "data.frame", row.names = c(NA, -2L))

输出应为:

structure(list(PO1.1 = NA, PO2.1 = 153744L, PO3.1 = NA, PO4.1 = 699468L), .Names =c("PO1.1", "PO2.1", "PO3.1", "PO4.1"), class = "data.frame", row.names = c(NA, -1L))

有什么建议吗?

2 个答案:

答案 0 :(得分:1)

正如问题中的示例所示,我假设Sample是一个因素。如果不是这种情况,请考虑这样做。

首先,让我们清理列Sample以使其成为合法名称,否则可能会导致错误

levels(myData$Sample)  <-  make.names(levels(myData$Sample))


## DEFINE THE CUTS##

# Adjust these as necessary
#--------------------------
  max.second <- 3  #  max & nin range of myData$second 
  min.second <- 0  #
  sprd <- 0.15     # with spread for each group
#--------------------------

# we will cut the myData$second according to intervals,   cut(myData$second, intervals)
intervals <- seq(min.second, max.second, sprd*2)

# Next, lets create a group column to split our  data frame by 
myData$group <- paste(myData$first, cut(myData$second, intervals), myData$Name, sep='-') 
groups <- split(myData, myData$group)

samples <- levels(myData$Sample)   ## I'm assuming not all samples are present in the example.  Manually adjusting with: samples <- sort(c(samples,  "PO3.1"))


# Apply over each group, then apply over each sample    
myOutput <- 
  t(sapply(groups, function(g) {

      #-------------------------------
      # NOTE: If it's possible that within a group there is more than one Area per Sample, then we have to somehow allow for thi. Hence the "paste(...)"
      res <- sapply(samples, function(s) paste0(g$Area[g$Sample==s], collapse=" - "))  # allowing for multiple values
      unlist(ifelse(res=="", NA, res))

      ## If there is (or should be) only one Area per Sample, then remove the two lines aboce and uncomment the two below:
      # res <- sapply(samples, function(s) g$Area[g$Sample==s])  # <~~ This line will work when only one value per sample
      # unlist(ifelse(res==0, NA, res))
      #-------------------------------

  }))

# Cleanup names
rownames(myOutput) <- paste("Group", 1:nrow(myOutput), sep="-")  ## or whichever proper group name

# remove dummy column 
myData$group <- NULL

结果

myOutput

        PO1.1    PO2.1    PO3.1 PO4.1            
Group-1 NA       "153744" NA    "699468"         
Group-2 NA       NA       NA    "32913 - 4948619"
Group-3 NA       NA       NA    "83528"          
Group-4 "536339" NA       NA    NA               
Group-5 "105598" NA       NA    NA        

答案 1 :(得分:1)

你真的不能指望R直觉PO2和PO4之间有第四个因子水平,现在可以。

> reshape(inp, direction="wide", idvar=c('first','second','Name'), timevar="Sample")
  first second               Name Area.PO4:1 Area.PO2:1 Area.PO1:1
1   120    1.3          Pentanone     699468     153744         NA
3   126    0.4            Heptene      32913         NA         NA
4   126    0.4            Heptene    4948619         NA         NA
5   126    0.3        Methylamine      83528         NA         NA
6   132    0.5            Benzene         NA         NA     536339
7   132    0.5 Ethene._trichloro-         NA         NA     105598