识别data.frame中仅包含R中的NA值的行

时间:2014-09-01 05:07:36

标签: r missing-data

我有一个data.frame,其中有34个观察到34个序数和NA个变量。我正在为市场细分研究执行聚类,并且需要删除仅NAs的行。取出userID后,我收到一条错误消息,说明在聚类之前只省略了2099行,只有NAs

我找到了一个用于删除所有NA值的行的链接,但我需要确定哪些2099行具有所有NA值。此处讨论的链接删除了包含所有NA值的行:Remove Rows with NAs in data.frame

以下是来自六个变量的前五个观察结果的样本:

> head(Store2df, n=5)
  RowNo      Age Gender HouseholdIncome MaritalStatus PresenceofChildren
1     1     <NA>   Male            <NA>          <NA>               <NA>
2     2    45-54 Female            <NA>          <NA>               <NA>
3     3     <NA>   <NA>            <NA>          <NA>               <NA>
4     4     <NA>   <NA>            <NA>          <NA>               <NA>
5     5    45-54 Female        75k-100k       Married                Yes
#Making a vector
> Vector1 <- Store2df$RowNo 
#Taking out RowNo column
> Store2df$RowNo <- NULL

编辑:我将结果放在一个对象中,但发现代码创建了一个额外的列。单击RStudio的环境,创建了一个名为row.names的额外列,用原始行名称标记每一行。删除了几千行,新列标记了具有旧行号的新行。但是当看到新对象的头部时,我没有看到行标签。为什么row.names标签在环境中显示,但在我查看头部时却不显示?

#Remove all rows with only NA values
> Store2df <- Store2[!!rowSums(!is.na(Store2)),]
#View head of store2df
> head(Store2df)
    Age Gender HouseholdIncome MaritalStatus PresenceofChildren
1  <NA>   Male            <NA>          <NA>               <NA>
2 45-54 Female            <NA>          <NA>               <NA>
5 45-54 Female        75k-100k       Married                Yes
6 25-34   Male        75k-100k       Married                 No
7 35-44 Female       125k-150k       Married                Yes
8 55-64   Male        75k-100k       Married                 No

编辑2:我输入了行号/ userID列来跟踪用户数量。为了执行删除所有NA的操作,我取出了第一列。现在我需要跟踪我删除的用户。我有一个包含所有NA值的超过2000行的列表,我不想手动创建每行的索引。

问题:如何删除丢失数据对应的电子邮件?

> #First six rows of the column RowNo
> head(Store2df$RowNo)
[1] 1 2 3 4 5 6

我想在包含data.frame的Store2df RowNo中删除2099行。这是一个脚本,用于标识没有data.frame的Store2df RowNo中哪些行都为空。

> which(rowSums(is.na(Store2df))==ncol(Store2df))

显示前6行,删除第3行和第4行。

> head(Store2df$RowNo)
[1] 1 2 5 6 7 8

我想完成4个步骤:

1)取出Store2df RowNo中的data.frame列并另存为单独的向量

2)删除Store2df NA

中所有data.frame值的行

3)将Store2new1 vector中的相同行删除为Store2df data.frame

4)将vectordata.framevector匹配data.frame

组合

2 个答案:

答案 0 :(得分:12)

 which(rowSums(is.na(Store2))==ncol(Store2))
 #3 4 
 #3 4 

或者

 which(Reduce(`&`,as.data.frame(is.na(Store2))))
 #[1] 3 4

或者

 which(!rowSums(!is.na(Store2)))  
 #3 4 
 #3 4 

数据

 Store2 <- structure(list(Age = c(NA, "45-54", NA, NA, "45-54"), Gender = c("Male", 
 "Female", NA, NA, "Female"), HouseholdIncome = c(NA, NA, NA, 
  NA, "75k-100k"), MaritalStatus = c(NA, NA, NA, NA, "Married"), 
PresenceofChildren = c(NA, NA, NA, NA, "Yes"), HomeOwnerStatus = c(NA, 
NA, NA, NA, "Own"), HomeMarketValue = c(NA, NA, NA, NA, "150k-200k"
)), .Names = c("Age", "Gender", "HouseholdIncome", "MaritalStatus", 
"PresenceofChildren", "HomeOwnerStatus", "HomeMarketValue"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5"))

更新

删除所有NA s

的行
  Store2[!!rowSums(!is.na(Store2)),]
  #   Age Gender HouseholdIncome MaritalStatus PresenceofChildren HomeOwnerStatus
  #1  <NA>   Male            <NA>          <NA>               <NA>            <NA>
  #2 45-54 Female            <NA>          <NA>               <NA>            <NA>
  #5 45-54 Female        75k-100k       Married                Yes             Own
   #HomeMarketValue
  #1            <NA>
  #2            <NA>
  #5       150k-200k
  • is.na(Store2)提供缺少元素或NA
  • 的逻辑索引
  • !会否定逻辑索引,即TRUE变为FALSE而反之亦然
  • 上面代码的
  • rowSums给出了每行not NA的元素总和

        rowSums(!is.na(Store2))
        #   1 2 3 4 5 
        #   1 2 0 0 7  # 3rd and 4th row have `0 non NA` values
    
  • !否定上述内容

        !rowSums(!is.na(Store2))
        # 1     2     3     4     5 
        #FALSE FALSE  TRUE  TRUE FALSE 
    
  • 我们希望删除all NA's0 non NAs的行。那么!再次

        !!rowSums(!is.na(Store2))
        #1     2     3     4     5 
        #TRUE  TRUE FALSE FALSE  TRUE 
    
  • 使用上述逻辑索引的子集

UPDATE2

如果您有两个rowNo,即在删除NA行之前单独存储的那个,而在删除NA之后存储第二个。{/ p>

   RowNo1 <- 1:6
   RowNo2 <- c(1,2,5,6)
   RowNo1 %in% RowNo2
   #[1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE
   RowNo1[RowNo1 %in% RowNo2]
   #[1] 1 2 5 6

UPDATE3

根据您的新请求,让我再试一次:

    Store2 <- structure(list(RowNo = 1:5, Age = c(NA, "45-54", NA, NA, "45-54"
    ), Gender = c("Male", "Female", NA, NA, "Female"), HouseholdIncome = c(NA, 
    NA, NA, NA, "75k-100k"), MaritalStatus = c(NA, NA, NA, NA, "Married"
   ), PresenceofChildren = c(NA, NA, NA, NA, "Yes")), .Names = c("RowNo", 
   "Age", "Gender", "HouseholdIncome", "MaritalStatus", "PresenceofChildren"
   ), class = "data.frame", row.names = c("1", "2", "3", "4", "5"
   ))

第一步

RowNo保存为单独的向量(我不确定为什么需要这个)

  Store2new1 <- Store2$RowNo

第二步

删除Store2 data.frame中包含所有NA值的行,并将其存储为Store2df

   Store2df <- Store2[!!rowSums(!is.na(Store2[,-1])),] #Here you already get the new dataset with `RowNo` column

   Store2df
   #RowNo   Age Gender HouseholdIncome MaritalStatus PresenceofChildren
   #1     1  <NA>   Male            <NA>          <NA>               <NA>
   #2     2 45-54 Female            <NA>          <NA>               <NA>
   #5     5 45-54 Female        75k-100k       Married                Yes

第三步

将Store2new1向量中的相同行删除为Store2df data.frame

   Store2new2 <- Store2new1[Store2new1 %in% Store2df$RowNo]
   Store2new1[Store2new1 %in% Store2df$RowNo]
   #[1] 1 2 5

第四步

我真的不认为第四步或第三步是必需的,除非您想要删除更多行,这在帖子中并不清楚。

答案 1 :(得分:4)

使用@akrun

提供的答案中发布的Store2样本数据
which(apply(Store2, 1, function(x) all(is.na(x))))
#3 4 
#3 4 

或者,类似于akrun的回答:

which(rowSums(!is.na(Store2))==0)
#3 4 
#3 4