我有一个类似下面的数据框。
Group Expenditure Date
A 56434 22 June 2014
B 54231 1 July 2013
B 1412 9 May 2011
A NA 28 July 2009
A NA 3 July 2009
C 98 2 July 1999
C NA 14 July 2004
我有兴趣为“支出”列创建缺失值报告。一个值应该返回每列的缺失值数量,这可以通过使用以下代码来解决
sapply(exp.dta, function(x) sum(is.na(x)))
另外,我想报告每个日期的缺失值的数量。使用 as.Date 功能将日期列格式化为正确的日期。至于现在,我对报告每个子组的缺失值不感兴趣。
答案 0 :(得分:3)
试试这个:
library(plyr)
ddply(your.data, .(Date), summarize, nNA = sum(is.na(Expenditure))
这会将数据按Date
拆分为子组,并将函数sum(is.na())
应用于这些子组的Expenditure
列。
例如,
df <- read.table(text="Group Expenditure Date
A 56434 22June2014
B 54231 1July2013
B 1412 9May2011
A NA 28July2009
A NA 3July2009
C 98 2July1999
C NA 14July2004 ", sep="", header=T)
ddply(df, .(Date), summarize, nNA=sum(is.na(Expenditure)))
的产率:
Date nNA
1 14July2004 1
2 1July2013 0
3 22June2014 0
4 28July2009 1
5 2July1999 0
6 3July2009 1
7 9May2011 0
还有一些base
- 解决方案。这里有一些例子:
使用by
by(df, df$Date, function(x) sum(is.na(x$Expenditure)))
使用tapply
with(df, tapply(Expenditure, Date, function(x) sum(is.na(x))))
使用aggregate
(帽子提示@ user20650)
aggregate(df$Expenditure, by=list(df$Date), FUN= function(x) sum(is.na(x)))
这些都给出了相同的结果,但格式略有不同。选择你最喜欢哪一个。对于更一般的治疗,这种问题被称为&#34; split-apply-combine&#34;,参见例如here
答案 1 :(得分:2)
按照您已编写的代码,您可以向其添加split
,
dat <- read.table(h=T, text = "Group Expenditure Date
A 56434 22-June-2014
B 54231 1-July-2013
B 1412 9-May-2011
A NA 28-July-2009
A NA 3-July-2009
C 98 2-July-1999
C NA 14-July-2004")
> sapply(split(dat$Expenditure, dat$Group), function(x) sum(is.na(x)))
# A B C
# 2 0 1
或每个日期,
> s <- split(dat$Expenditure, dat$Date)
> as.matrix(sapply(s, function(x) sum(is.na(x))))
# [,1]
# 14-July-2004 1
# 1-July-2013 0
# 22-June-2014 0
# 28-July-2009 1
# 2-July-1999 0
# 3-July-2009 1
# 9-May-2011 0
答案 2 :(得分:1)
或者,使用dplyr
:
library('dplyr')
summarize(group_by(df, Date), nNA = sum(is.na(Expenditure)))