Question

我理解这是一个非常基本的问题，但我不明白R中的含义是什么。

作为参考，我做了一个简单的脚本来读取CSV表，过滤其中一个字段，将其传递给一个新变量并清除为第一个变量分配的内存。如果我在我过滤的字段上调用unique（），我会看到结果确实被过滤了，但是还有一行显示与原始数据集中的数据相对应的“级别”。

示例：

df = read.csv(path, sep=",", header=TRUE)
df_intrate = df[df$AssetClass == "ASSET CLASS A", ]

rm(df)
gc()

unique(df_intrate$AssetClass)

结果：

[1] ASSET CLASS A
Levels: ASSET CLASS E ASSET CLASS D ASSET CLASS C ASSET CLASS B ASSET CLASS A

尽管R studio显示df_intrate确实是df的预期行数，但是ASSET CLASS A的结构信息是否以某种方式保留在df_intrate中？

Answer 1

尽管R studio显示df_intrate确实是ASSET CLASS A的预期行数，但df中的结构信息是否以某种方式保留在df_intrate中？

是。这就是分类变量（称为因子）存储在R中的方式 - 级别，所有可能值的向量以及所采用的实际值都存储在：

x = factor(c('a', 'b', 'c', 'a', 'b', 'b'))
x
# [1] a b c a b b
# Levels: a b c

y = x[1]
# [1] a
# Levels: a b c

您可以使用droplevels()删除未使用的级别，或者重新应用factor功能，只创建一个新的因素：

droplevels(y)
# [1] a
# Levels: a

factor(y)
# [1] a
# Levels: a

您还可以在数据框上使用droplevels从所有因子列中删除所有未使用的级别：

dat = data.frame(x = x)
str(dat)
# 'data.frame': 6 obs. of  1 variable:
#  $ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 2

str(dat[1, ])
# Factor w/ 3 levels "a","b","c": 1

str(droplevels(dat[1, ]))
# Factor w/ 1 level "a": 1

虽然与您当前的问题无关，但我们还应该提到factor有一个可选的levels参数，可用于指定因子的级别和它们应该的顺序。如果您需要特定订单（可能用于绘图或建模），或者如果有更多可能的级别而不是实际存在且您想要包含它们，则此选项非常有用。如果您未指定levels，则默认为字母顺序。

x = c("agree", "disagree", "agree", "neutral", "strongly agree")
factor(x)
# [1] agree         disagree      agree         neutral       strongly agree
# Levels: agree disagree neutral strongly agree
## not a good order

factor(x, levels = c("disagree", "neutral", "agree", "strongly agree"))
# [1] agree          disagree       agree          neutral        strongly agree
# Levels: disagree neutral agree strongly agree
## better order

factor(x, levels = c("strongly disagree", "disagree", "neutral", "agree", "strongly agree"))
# [1] agree          disagree       agree          neutral        strongly agree
# Levels: strongly disagree disagree neutral agree strongly agree
## good order, more levels than are actually present

您可以使用?reorder和?relevel（或仅使用factor）来更改已创建因素的等级顺序。

Answer 2

您在R中的数据结构中看到Levels，称为factor。因素是整数类型：

typeof(as.factor(letters))
#[1] "integer"

但是，它们具有标签，将每个整数映射到字符规范（标签）。您将看到，在算法需要数字（有时以虚拟变量的形式）但在模型解释过程中保留对人类更有意义的标签的模型中，因子通常是有用的。

级别是向量的属性：

attributes(as.factor(letters))
#$levels
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
#[18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

#$class
#[1] "factor"

这意味着，只有将列配置为ASSET CLASS A后，列的属性才会被转移。这与矢量的长度无关，尽管它仍为[1]。

Answer 3

R有一个character类和一个factor类。 character是您的基本字符串数据结构。 factor对统计数据来说非常重要：例如，您可能有一个数据集，其中人们被耳垂的连通性划分（一个重要但常被忽视的区别）。在这种情况下，对于每个人，他们将具有值connected或free。如果您根据耳垂连接状态来模拟智能，那么您希望该模型能够理解有两类人：connected或free，所以你和＃39; d将其表示为factor向量，该向量将包含两个levels：connected和free。所以在语义上为什么级别是R中的东西。

从语法上讲，factor和character变量会以不同的方式响应as.integer。 factor个变量会转换为与其级别对应的数字，而character变量会转换为更像传统的atoi。一般情况下，如果您操作factor变量并认为它是character，则会遇到很多问题。

当我正在阅读csv文件时，在大多数情况下，我发现我的character值比factors更高，所以我通常设置read.csv(..., stringsAsFactor=FALSE)。（YMMV是否这是您的一般偏好。）

R中的等级是多少？

3 个答案: