我有一个数据框Object o = MethodHandles.zero(klass).invoke();
,其中列df
包含类型因子的数据:
foo
当我用df <- data.frame("bar" = c(1:4), "foo" = c("M", "F", "F", "M"))
检查结构时,我得到了:
具有3个级别“”,“ F”,..的因子:2 2 2 2 2 2 2 2 2 2 2 ..
为什么我的数据中只有2个级别时会报告3个级别?
编辑:
我似乎通过分配str(df$foo)
来清除缺失的值""
。
当我打电话给NA
时,它似乎仍在计算“缺失值”水平,但是没有发现任何情况:
table(df$foo)
但是,当我打电话给 F M
0 2 2
时,我发现它只报告两个级别:
df$foo
Levels: F M
仍可能会计算空级别,如何解决该行为?
答案 0 :(得分:2)
检查您的数据框是否确实没有缺失值,因为看起来确实是这样。试试这个:
# works because factor-levels are integers, internally; "" seems to be level 1
which(as.integer(df$MF) == 1)
# works if your missing value is just ""
which(df$MF == "")
然后,您应该清理数据框以正确重新引用缺失的值。 factor
将处理NA
:
df <- data.frame("rest" = c(1:5), "sex" = c("M", "F", "F", "M", ""))
df$sex[which(as.integer(df$sex) == 1)] <- NA
清除数据后,您将不得不丢弃未使用的级别,以避免诸如table
之类的表格计算空级别的出现。
观察以下步骤序列及其输出:
# Build a dataframe to reproduce your behaviour
> df <- data.frame("Restaurant" = c(1:5), "MF" = c("M", "F", "F", "M", ""))
# notice the empty level "" for the missing value
> levels(df$MF)
[1] "" "F" "M"
# notice how a tabulation counts the empty level;
# this is the first column with a 1 (it has no label because
# there is no label, it is "")
> table(df$MF)
F M
1 2 2
# find the culprit and change it to NA
> df$MF[which(as.integer(df$MF) == 1)] <- as.factor(NA)
# AHA! So despite us changing the value, the original factor
# was not updated! I wonder what happens if we tabulate the column...
> levels(df$MF)
[1] "" "F" "M"
# Indeed, the empty level is present in the factor, but there are
# no occurences!
> table(df$MF)
F M
0 2 2
# droplevels to the rescue:
# it is used to drop unused levels from a factor or, more commonly,
# from factors in a data frame.
> df$MF <- droplevels(df$MF)
# factors fixed
> levels(df$MF)
[1] "F" "M"
# tabulation fixed
> table(df$MF)
F M
2 2