在将NA分配给缺失值后,为什么将因子中的空水平制成表格?

时间:2018-10-25 12:24:13

标签: r dataframe

我有一个数据框Object o = MethodHandles.zero(klass).invoke(); ,其中列df包含类型因子的数据:

foo

当我用df <- data.frame("bar" = c(1:4), "foo" = c("M", "F", "F", "M")) 检查结构时,我得到了:

  

具有3个级别“”,“ F”,..的因子:2 2 2 2 2 2 2 2 2 2 2 ..

为什么我的数据中只有2个级别时会报告3个级别?


编辑:

我似乎通过分配str(df$foo)来清除缺失的值""。 当我打电话给NA时,它似乎仍在计算“缺失值”水平,但是没有发现任何情况:

table(df$foo)

但是,当我打电话给 F M 0 2 2 时,我发现它只报告两个级别:

df$foo

Levels: F M 仍可能会计算空级别,如何解决该行为?

1 个答案:

答案 0 :(得分:2)

检查您的数据框是否确实没有缺失值,因为看起来确实是这样。试试这个:

# works because factor-levels are integers, internally; "" seems to be level 1
which(as.integer(df$MF) == 1)

# works if your missing value is just ""
which(df$MF == "") 

然后,您应该清理数据框以正确重新引用缺失的值。 factor将处理NA

df <- data.frame("rest" = c(1:5), "sex" = c("M", "F", "F", "M", ""))
df$sex[which(as.integer(df$sex) == 1)] <- NA

清除数据后,您将不得不丢弃未使用的级别,以避免诸如table之类的表格计算空级别的出现。

观察以下步骤序列及其输出:

# Build a dataframe to reproduce your behaviour
> df <- data.frame("Restaurant" = c(1:5), "MF" = c("M", "F", "F", "M", ""))
# notice the empty level "" for the missing value
> levels(df$MF)
[1] ""  "F" "M"

# notice how a tabulation counts the empty level;
# this is the first column with a 1 (it has no label because
# there is no label, it is "")
> table(df$MF)

  F M 
1 2 2

# find the culprit and change it to NA
> df$MF[which(as.integer(df$MF) == 1)] <- as.factor(NA)

# AHA! So despite us changing the value, the original factor
# was not updated! I wonder what happens if we tabulate the column...
> levels(df$MF)
[1] ""  "F" "M"

# Indeed, the empty level is present in the factor, but there are
# no occurences!
> table(df$MF)

  F M 
0 2 2 

# droplevels to the rescue:
# it is used to drop unused levels from a factor or, more commonly,
# from factors in a data frame.
> df$MF <- droplevels(df$MF)

# factors fixed
> levels(df$MF)
[1] "F" "M"

# tabulation fixed
> table(df$MF)

F M 
2 2