我的数据看起来像这样:
df <- data.frame(pop = c("Spades", "Spades", "Spades", "Clubs", "Clubs", "Clubs", "Diamonds", "Diamonds", "Hearts", "Hearts"),
type = c("Ace", "Two", "Three", "Ace", "Two", "Three", "Ace", "Two", "King", "Queen"),
V1 = c(4, 3, NA, 7, NA, NA, 5, 12, NA, NA),
V2 = c(16, 23, NA, 15, NA, NA, 8, 19, NA, NA))
我需要将NA归为0,但仅限于非常具体的情况。对于每个pop(填充)和类型,数据(V1,V2等)必须包含所有NA或所有数字。所以在这个例子中,黑桃流行音乐在V1和V2中缺少黑桃三行的数据,而黑桃王牌和黑桃二星则有数据。因此,Spades-Three的V1和V2需要从NA变为0.同样适用于Clubs pop。
结果数据集应如下所示:
df2 <- data.frame(pop = c("Spades", "Spades", "Spades", "Clubs", "Clubs", "Clubs", "Diamonds", "Diamonds", "Hearts", "Hearts"),
type = c("Ace", "Two", "Three", "Ace", "Two", "Three", "Ace", "Two", "King", "Queen"),
V1 = c(4, 3, 0, 7, 0, 0, 5, 12, NA, NA),
V2 = c(16, 23, 0, 15, 0, 0, 8, 19, NA, NA))
我可以使用此代码执行此插补:
ID <- unique(df$pop)
for (i in 1:length(ID)) {
dftemp <- filter(df, pop == paste(ID[i]))
# Number of unique categories for a pop-type combination
num_type <- length(dftemp$type)
# Number of NA's in that combination for V1
num_na <- sum(is.na(dftemp$V1) == TRUE)
print(num_type)
print(num_na)
if (num_na < num_type && num_na > 0) {
# print(paste(ID[i]))
df$V1[with(df, pop == paste(ID[i]) & is.na(V1))] <- 0
df$V2[with(df, pop == paste(ID[i]) & is.na(V2))] <- 0
}
}
我的问题是扩大规模。我需要为更多列执行此操作,因此我想将列名放入一个列表中,然后我可以通过循环传递它。但出于某种原因,在上一个if
循环中,从
df$V1[with(df, pop == paste(ID[i]) & is.na(V1))] <- 0
到
df[newlist[k]][with(df, pop == paste(ID[i]) & is.na(newlist[k]))] <- 0
(其中newlist <- c("V1", "V2", "V3", "V4")
等)
使pop == paste(ID[i])
条件不再有效。如果我指定pop == "Spades"
,那么它可以工作,但显然这比旧方法效率更低。
最终目标是创建一个函数,我可以传递df名称和列列表以使其工作,但我发现自己因这个问题而陷入困境。
我目前编写函数的尝试看起来像这样:
imputezero <- function(df, columnlist) {
for (i in 1:length(ID)) {
for (x in 1:length(columnlist)) {
dftemp <- filter(df, pop == paste(ID[i]))
num_type <- length(dftemp$type)
num_na <- sum(is.na(dftemp[collist[x]]) == TRUE)
if (num_na < num_type && num_na > 0) {
df[columnlist[x]][with(df, pop == paste(ID[i]) & is.na(df[columnlist[x]]))] <- 0
return(df)
}
}
}
}
list_status <- c("V1", "V2")
test_df <- imputezero(df, list_status)
那么我怎样才能让df[columnlist[x]][with(df, pop == paste(ID[i]) & is.na(df[columnlist[x]]))] <- 0
工作?
如果我的一般方法完全错误或者有办法消除所有噪音,我也欢迎任何反馈。
答案 0 :(得分:0)
您可以使用mutate_at
中的dplyr
来实现此目标,它可以按任意数量的列进行扩展
如果我理解正确,你可以这样做:
df %>%
group_by(pop) %>%
mutate_at(.funs = funs(ifelse(is.na(.) & sum(is.na(.)) != n(), 0, .)),
.vars = vars(-type))
答案 1 :(得分:0)
当num_ra等于num_type或num_na为零时,我更改函数中的if以跳过循环。然后我执行df [columnlist [x]] [with(df,pop == paste(ID [i])&amp; is.na(df [columnlist [x]]))]&lt; - 0行代码。我将return(df)移动到函数的末尾。这似乎有效。
imputezero <- function(df, columnlist) {
for (i in 1:length(ID)) {
for (x in 1:length(columnlist)) {
dftemp <- filter(df, pop == paste(ID[i]))
num_type <- length(dftemp$type)
num_na <- sum(is.na(dftemp[columnlist[x]]) == TRUE)
if (num_na == num_type | num_na == 0) {
next
}
df[columnlist[x]][with(df, pop == paste(ID[i]) & is.na(df[columnlist[x]]))] <- 0
}
}
return(df)
}