I want to subset a dataframe by applying two conditions to it. When I attach the dataframe, apply the first condition, detach the dataframe, attach it again, apply the second condition, and detach again, I get the expected result, a dataframe with 9 observations.
Of course, you wouldn't normally detach/attach before applying the second condition. So I attach, apply the two conditions after one another, and then detach. But the result is different now: It's a dataframe with 24 observations. All but 5 of these observations consist exclusively of NA-values.
I know there's lots of advice against using attach, and I understand the point that it's dangerous, because it's easy to loose track of an attach statement still being active. My point here is a different one; I see a behaviour in attach that I just can't understand. I'm using R Studio 0.99.465 with 64-bit-R 3.2.1.
So here's the code, first the version that is clumsy, but produces the correct result (df with 9 observations, all non-NA):
df <- expand.grid(early_vvl=c(0,1), inter_churn=c(0,1), inter_new_contract=c(0,1), late_vvl=c(0,1), late_no_reaction=c(0,1))
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl,]
detach(df)
attach(df)
df <- df[early_vvl <= late_no_reaction,]
detach(df)
Now the one that produces the dataframe with 24 observations, most of which consist only of NA values:
df <- expand.grid(early_vvl=c(0,1), inter_churn=c(0,1), inter_new_contract=c(0,1), late_vvl=c(0,1), late_no_reaction=c(0,1))
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl,]
df <- df[early_vvl <= late_no_reaction,]
detach(df)
I'm puzzled. Does anybody understand why the second version produces a different result?
答案 0 :(得分:3)
Have a look at what happens here:
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl,]
length(early_vvl <= late_no_reaction)
## [1] 32
df <- df[early_vvl <= late_no_reaction,]
detach(df)
So your logical vector early_vvl <= late_no_reaction
still uses the original df
, the one that you attached. When you subset the data.frame
the second time, the logical is longer than the data.frame
has rows and it behaves like this:
df <- data.frame(x=1:5, y = letters[1:5])
df[rep(c(TRUE, FALSE), 5), ]
## x y
## 1 1 a
## 3 3 c
## 5 5 e
## NA NA <NA>
## NA.1 NA <NA>
You could just use &
to avoid the problem:
df <- expand.grid(early_vvl=c(0,1), inter_churn=c(0,1), inter_new_contract=c(0,1), late_vvl=c(0,1), late_no_reaction=c(0,1))
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl & early_vvl <= late_no_reaction,]
detach(df)