R循环访问data.frame并获取变量计数

时间:2016-02-25 19:27:05

标签: r loops

我有一个包含两列的data.frame,一个唯一标识符和一个结果。我需要遍历data.frame并获取有多少唯一标识符的计数以及唯一结果的计数。结果列可以有三种可能的结果,正面,负面或不明确。因此,例如,如果有10个“RVP PCR”标识符,我需要创建一个包含四列的行," Count",“Positive”,“Negative”,“Ambiguous”,并且在这些列中应该是计算他们发生了多少次。因此,在具有10个“RVP PCR”标识符的示例中,输出行应显示标识符,而不是计数10,7个负数,1个正数和2个不明确数。你怎么用R来完成这个?

str(foo)
>
'data.frame':   51 obs. of  2 variables:
 $ identifier: Factor w/ 99 levels "ADENOPCR","ALB-BF",..: 51 51 56 56 57 57 57 57 18 18 ...
 $ result    : Factor w/ 3 levels "Ambiguous","Negative",..: 2 1 2 1 2 1 2 1 2 1 ...



dput(foo)
>
    structure(list(identifier = structure(c(80L, 80L, 80L, 80L, 80L, 
80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 
80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 64L, 
18L, 18L, 76L, 76L, 76L, 70L, 70L, 70L, 70L, 71L, 64L, 77L, 77L, 
77L, 77L, 77L, 77L, 77L, 77L, 76L), .Label = c("ADENOPCR", "ALB-BF", 
"ASPERAG", "ASPERAGB", "BDGLUCAN", "BLASTO", "BORD PCR", "BPERT", 
"CMV QNT", "CMVPCR", "COCCI", "COCCI G/M", "COCCI PAN", "COCCI-PPT", 
"CPNEUMOPCR", "CRP", "CRY BLD", "CWP-KOH", "DIFF CONF", "EBV PAN", 
"EBV PAN 2", "EBV QNT", "EXCEPT", "EXCEPT TT", "FLUFAC", "FUNG PKG", 
"FUNGSEQ", "GLU-FL", "HERP I", "HHV6PCR", "HISTO", "HISTO PPT", 
"HISTOAG S", "HISTOGM U", "HMPVFA", "HMPVPCR", "HSVPCR", "LEGAG-U", 
"LEGIONFA", "LEGIONPCR", "MA AFB", "MA FUNGAL", "MA MIC", "MA MTBPRIM", 
"MC AFB", "MC AFBID", "MC AFBR", "MC BAL", "MC BLD", "MC CYST", 
"MC FUNG", "MC FUNGID", "MC Legion", "MC LEGION", "MC MTD", "MC NOC", 
"MC RESP", "MC STAPH", "MC Strep", "MC STREP", "MC VRE", "MC W", 
"MICROSEQ", "MPNEUMOPCR", "MS CWP", "MTBRIF PCR", "MYCO-M", "NG REPORT", 
"ORGSEQ", "PARAFLUPCR", "PCP PCR", "PNEUMO AB", "PNEUMST", "PNEUMST R", 
"RESPMINI", "RESPMINI ", "RSPFA", "RSPFAC", "RSV", "RVP PCR", 
"RVPPCR", "SPN AG", "TP-FL", "V CMVC", "V FLUC", "V HSVC", "V HSVCT", 
"V RESPC", "V Urea", "V VIC", "V VIC R", "V VIRAL", "V VIRAL N", 
"V VIRAL R", "V VZV", "VDRL CSF", "VZVFAC", "VZVPCR", "WNILE PCR"
), class = "factor"), result = structure(c(2L, 2L, 3L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 
2L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Ambiguous", 
"Negative", "Positive"), class = "factor")), .Names = c("identifier", 
"result"), row.names = 1500:1550, class = "data.frame")

5 个答案:

答案 0 :(得分:2)

我不完全确定您的预期输出是什么,但您可以重塑数据:

library(reshape2)

dcast(foo, identifier~result, fun.aggregate= length)

这会产生:

  identifier Negative Positive
1    CWP-KOH        2        0
2 MPNEUMOPCR        0        2
3 PARAFLUPCR        3        1
4    PCP PCR        0        1
5  RESPMINI         4        0
6      RSPFA        7        1
7    RVP PCR       28        2

########编辑添加#############

根据您提供的数据,“RVP PCR”无法产生您所说的结果。

答案 1 :(得分:2)

library(dplyr)
library(tidyr)
foo %>%
  group_by(identifier, result) %>%
  summarise(n = n()) %>%
  spread(key = result, value = n, drop = FALSE, fill = 0) %>%
  mutate(Total = Ambiguous + Negative + Positive) %>%
  filter(Total > 0)

结果

Source: local data frame [7 x 5]
Groups: identifier [7]

  identifier Ambiguous Negative Positive Total
      (fctr)     (dbl)    (dbl)    (dbl) (dbl)
1    CWP-KOH         0        2        0     2
2 MPNEUMOPCR         0        0        2     2
3 PARAFLUPCR         0        3        1     4
4    PCP PCR         0        0        1     1
5  RESPMINI          0        4        0     4
6      RSPFA         0        7        1     8
7    RVP PCR         0       28        2    30

答案 2 :(得分:1)

如果没有额外的套餐,您可以这样做:

xtabs(~ identifier + result, data=droplevels(foo))

这给出了这个结果:

> xtabs(~ identifier + result, data=droplevels(foo))
            result
identifier   Negative Positive
  CWP-KOH           2        0
  MPNEUMOPCR        0        2
  PARAFLUPCR        3        1
  PCP PCR           0        1
  RESPMINI          4        0
  RSPFA             7        1
  RVP PCR          28        2

如果您需要数据框:

as.data.frame(unclass(xtabs(~ identifier + result, data=droplevels(foo))))

如果您想要长格式的结果,您也可以这样做:

foo$count <- 1
aggregate(count ~ identifier+result, data=foo, FUN=length)

答案 3 :(得分:1)

数据采用长格式。首先使用reshape2库中的dcast命令将其更改为宽。添加一列并获取所有行的总和。

library(reshape2)    
widedata<-dcast(foo,identifier~result)
widedata$Count<-0 #adds column for Count
widedata$Count<-rowSums (widedata[,2:4], na.rm = FALSE, dims = 1) #[,2:4] since the data will have a column for ambiguous as well.

答案 4 :(得分:0)

library(tidyr)
library(dplyr)

foo %>%
  count(identifier, result) %>%
  spread(result, n) # or spread(result, n, fill = 0, drop = FALSE)

#   identifier Negative Positive
#       (fctr)    (int)    (int)
# 1    CWP-KOH        2       NA
# 2 MPNEUMOPCR       NA        2
# 3 PARAFLUPCR        3        1
# 4    PCP PCR       NA        1
# 5  RESPMINI         4       NA
# 6      RSPFA        7        1
# 7    RVP PCR       28        2