Question

我有一个数据框Object o = MethodHandles.zero(klass).invoke();，其中列df包含类型因子的数据：

foo

当我用df <- data.frame("bar" = c(1:4), "foo" = c("M", "F", "F", "M"))检查结构时，我得到了：

具有3个级别“”，“ F”，..的因子：2 2 2 2 2 2 2 2 2 2 2 ..

为什么我的数据中只有2个级别时会报告3个级别？

编辑：

我似乎通过分配str(df$foo)来清除缺失的值""。当我打电话给NA时，它似乎仍在计算“缺失值”水平，但是没有发现任何情况：

table(df$foo)

但是，当我打电话给F M 0 2 2时，我发现它只报告两个级别：

df$foo

Levels: F M仍可能会计算空级别，如何解决该行为？

Answer 1

检查您的数据框是否确实没有缺失值，因为看起来确实是这样。试试这个：

# works because factor-levels are integers, internally; "" seems to be level 1
which(as.integer(df$MF) == 1)

# works if your missing value is just ""
which(df$MF == "")

然后，您应该清理数据框以正确重新引用缺失的值。 factor将处理NA：

df <- data.frame("rest" = c(1:5), "sex" = c("M", "F", "F", "M", ""))
df$sex[which(as.integer(df$sex) == 1)] <- NA

清除数据后，您将不得不丢弃未使用的级别，以避免诸如table之类的表格计算空级别的出现。

观察以下步骤序列及其输出：

# Build a dataframe to reproduce your behaviour
> df <- data.frame("Restaurant" = c(1:5), "MF" = c("M", "F", "F", "M", ""))
# notice the empty level "" for the missing value
> levels(df$MF)
[1] ""  "F" "M"

# notice how a tabulation counts the empty level;
# this is the first column with a 1 (it has no label because
# there is no label, it is "")
> table(df$MF)

  F M 
1 2 2

# find the culprit and change it to NA
> df$MF[which(as.integer(df$MF) == 1)] <- as.factor(NA)

# AHA! So despite us changing the value, the original factor
# was not updated! I wonder what happens if we tabulate the column...
> levels(df$MF)
[1] ""  "F" "M"

# Indeed, the empty level is present in the factor, but there are
# no occurences!
> table(df$MF)

  F M 
0 2 2 

# droplevels to the rescue:
# it is used to drop unused levels from a factor or, more commonly,
# from factors in a data frame.
> df$MF <- droplevels(df$MF)

# factors fixed
> levels(df$MF)
[1] "F" "M"

# tabulation fixed
> table(df$MF)

F M 
2 2

在将NA分配给缺失值后，为什么将因子中的空水平制成表格？

1 个答案: