我有一个data.frame
,其中有34个观察到34个序数和NA
个变量。我正在为市场细分研究执行聚类,并且需要删除仅NAs
的行。取出userID后,我收到一条错误消息,说明在聚类之前只省略了2099行,只有NAs
。
我找到了一个用于删除所有NA
值的行的链接,但我需要确定哪些2099行具有所有NA值。此处讨论的链接删除了包含所有NA
值的行:Remove Rows with NAs in data.frame
以下是来自六个变量的前五个观察结果的样本:
> head(Store2df, n=5)
RowNo Age Gender HouseholdIncome MaritalStatus PresenceofChildren
1 1 <NA> Male <NA> <NA> <NA>
2 2 45-54 Female <NA> <NA> <NA>
3 3 <NA> <NA> <NA> <NA> <NA>
4 4 <NA> <NA> <NA> <NA> <NA>
5 5 45-54 Female 75k-100k Married Yes
#Making a vector
> Vector1 <- Store2df$RowNo
#Taking out RowNo column
> Store2df$RowNo <- NULL
编辑:我将结果放在一个对象中,但发现代码创建了一个额外的列。单击RStudio的环境,创建了一个名为row.names的额外列,用原始行名称标记每一行。删除了几千行,新列标记了具有旧行号的新行。但是当看到新对象的头部时,我没有看到行标签。为什么row.names标签在环境中显示,但在我查看头部时却不显示?
#Remove all rows with only NA values
> Store2df <- Store2[!!rowSums(!is.na(Store2)),]
#View head of store2df
> head(Store2df)
Age Gender HouseholdIncome MaritalStatus PresenceofChildren
1 <NA> Male <NA> <NA> <NA>
2 45-54 Female <NA> <NA> <NA>
5 45-54 Female 75k-100k Married Yes
6 25-34 Male 75k-100k Married No
7 35-44 Female 125k-150k Married Yes
8 55-64 Male 75k-100k Married No
编辑2:我输入了行号/ userID列来跟踪用户数量。为了执行删除所有NA
的操作,我取出了第一列。现在我需要跟踪我删除的用户。我有一个包含所有NA
值的超过2000行的列表,我不想手动创建每行的索引。
问题:如何删除丢失数据对应的电子邮件?
> #First six rows of the column RowNo
> head(Store2df$RowNo)
[1] 1 2 3 4 5 6
我想在包含data.frame
的Store2df RowNo
中删除2099行。这是一个脚本,用于标识没有data.frame
的Store2df RowNo
中哪些行都为空。
> which(rowSums(is.na(Store2df))==ncol(Store2df))
显示前6行,删除第3行和第4行。
> head(Store2df$RowNo)
[1] 1 2 5 6 7 8
我想完成4个步骤:
1)取出Store2df RowNo
中的data.frame
列并另存为单独的向量
2)删除Store2df NA
data.frame
值的行
3)将Store2new1 vector
中的相同行删除为Store2df data.frame
4)将vector
和data.frame
与vector
匹配data.frame
答案 0 :(得分:12)
which(rowSums(is.na(Store2))==ncol(Store2))
#3 4
#3 4
或者
which(Reduce(`&`,as.data.frame(is.na(Store2))))
#[1] 3 4
或者
which(!rowSums(!is.na(Store2)))
#3 4
#3 4
Store2 <- structure(list(Age = c(NA, "45-54", NA, NA, "45-54"), Gender = c("Male",
"Female", NA, NA, "Female"), HouseholdIncome = c(NA, NA, NA,
NA, "75k-100k"), MaritalStatus = c(NA, NA, NA, NA, "Married"),
PresenceofChildren = c(NA, NA, NA, NA, "Yes"), HomeOwnerStatus = c(NA,
NA, NA, NA, "Own"), HomeMarketValue = c(NA, NA, NA, NA, "150k-200k"
)), .Names = c("Age", "Gender", "HouseholdIncome", "MaritalStatus",
"PresenceofChildren", "HomeOwnerStatus", "HomeMarketValue"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))
删除所有NA
s
Store2[!!rowSums(!is.na(Store2)),]
# Age Gender HouseholdIncome MaritalStatus PresenceofChildren HomeOwnerStatus
#1 <NA> Male <NA> <NA> <NA> <NA>
#2 45-54 Female <NA> <NA> <NA> <NA>
#5 45-54 Female 75k-100k Married Yes Own
#HomeMarketValue
#1 <NA>
#2 <NA>
#5 150k-200k
is.na(Store2)
提供缺少元素或NA
!
会否定逻辑索引,即TRUE
变为FALSE
而反之亦然 rowSums
给出了每行not NA
的元素总和
rowSums(!is.na(Store2))
# 1 2 3 4 5
# 1 2 0 0 7 # 3rd and 4th row have `0 non NA` values
!
否定上述内容
!rowSums(!is.na(Store2))
# 1 2 3 4 5
#FALSE FALSE TRUE TRUE FALSE
我们希望删除all NA's
或0 non NAs
的行。那么!
再次
!!rowSums(!is.na(Store2))
#1 2 3 4 5
#TRUE TRUE FALSE FALSE TRUE
使用上述逻辑索引的子集
如果您有两个rowNo
,即在删除NA行之前单独存储的那个,而在删除NA之后存储第二个。{/ p>
RowNo1 <- 1:6
RowNo2 <- c(1,2,5,6)
RowNo1 %in% RowNo2
#[1] TRUE TRUE FALSE FALSE TRUE TRUE
RowNo1[RowNo1 %in% RowNo2]
#[1] 1 2 5 6
根据您的新请求,让我再试一次:
Store2 <- structure(list(RowNo = 1:5, Age = c(NA, "45-54", NA, NA, "45-54"
), Gender = c("Male", "Female", NA, NA, "Female"), HouseholdIncome = c(NA,
NA, NA, NA, "75k-100k"), MaritalStatus = c(NA, NA, NA, NA, "Married"
), PresenceofChildren = c(NA, NA, NA, NA, "Yes")), .Names = c("RowNo",
"Age", "Gender", "HouseholdIncome", "MaritalStatus", "PresenceofChildren"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"
))
将RowNo
保存为单独的向量(我不确定为什么需要这个)
Store2new1 <- Store2$RowNo
删除Store2
data.frame中包含所有NA值的行,并将其存储为Store2df
Store2df <- Store2[!!rowSums(!is.na(Store2[,-1])),] #Here you already get the new dataset with `RowNo` column
Store2df
#RowNo Age Gender HouseholdIncome MaritalStatus PresenceofChildren
#1 1 <NA> Male <NA> <NA> <NA>
#2 2 45-54 Female <NA> <NA> <NA>
#5 5 45-54 Female 75k-100k Married Yes
将Store2new1向量中的相同行删除为Store2df data.frame
Store2new2 <- Store2new1[Store2new1 %in% Store2df$RowNo]
Store2new1[Store2new1 %in% Store2df$RowNo]
#[1] 1 2 5
我真的不认为第四步或第三步是必需的,除非您想要删除更多行,这在帖子中并不清楚。
答案 1 :(得分:4)
使用@akrun
提供的答案中发布的Store2
样本数据
which(apply(Store2, 1, function(x) all(is.na(x))))
#3 4
#3 4
或者,类似于akrun的回答:
which(rowSums(!is.na(Store2))==0)
#3 4
#3 4