split
是R核心中特别重要的功能。提供有关数据操作的基于R的解决方案的许多Stack Overflow答案都依赖于此。这是任何分组操作的主要工作。
也有许多问题,其解决方案仅是split
的一行。很多人不知道
split.data.frame
可以按行拆分矩阵; split.default
可以按列拆分数据帧。在split
上的R文档可能做得不好。它确实提到了第一次使用,但没有提及第二次使用。
R核心中的split
有四种方法:
methods(split)
#[1] split.data.frame split.Date split.default split.POSIXct
我将提供一个答案,深入解释split.data.frame
,split.default
和C级.Internal(split(x, f))
的工作方式。欢迎在“日期”和“ POSIXct”对象上输入其他答案。
答案 0 :(得分:7)
split.data.frame
如何工作?function (x, f, drop = FALSE, ...)
lapply(split(x = seq_len(nrow(x)), f = f, drop = drop, ...),
function(ind) x[ind, , drop = FALSE])
它调用split.default
拆分行索引向量seq_len(nrow(x))
,然后使用lapply
循环将关联的行提取到列表项中。
严格来说,这不是“ data.frame”方法。 它会按一维拆分任何二维对象,包括按行拆分矩阵。
split.default
如何工作?function (x, f, drop = FALSE, sep = ".", lex.order = FALSE, ...)
{
if (!missing(...))
.NotYetUsed(deparse(...), error = FALSE)
if (is.list(f))
f <- interaction(f, drop = drop, sep = sep, lex.order = lex.order)
else if (!is.factor(f))
f <- as.factor(f)
else if (drop)
f <- factor(f)
storage.mode(f) <- "integer"
if (is.null(attr(x, "class")))
return(.Internal(split(x, f)))
lf <- levels(f)
y <- vector("list", length(lf))
names(y) <- lf
ind <- .Internal(split(seq_along(x), f))
for (k in lf) y[[k]] <- x[ind[[k]]]
y
}
x
没有类(即主要是原子向量),则使用.Internal(split(x, f))
; .Internal(split())
沿x
分割索引,然后使用for
循环将关联的元素提取到列表项中。原子向量(请参见?vector
)是具有以下模式的向量:
一个有类的对象...呃...有很多!!让我仅举三个例子:
我认为split.default
写得不好。具有类的对象太多了,但是split.default
将通过"["
以相同的方式处理它们。可以在“ factor”和“ data.frame”上正常工作(因此,我们将沿着列拆分数据帧!),但绝对不能以我们期望的方式在矩阵上工作。
A <- matrix(1:9, 3)
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
split.default(A, c(1, 1, 2)) ## it does not split the matrix by columns!
#$`1`
#[1] 1 2 4 5 7 8
#
#$`2`
#[1] 3 6 9
实际上回收规则已应用于c(1, 1, 2)
,我们同样在做:
split(c(A), rep_len(c(1,1,2), length(A)))
为什么R核心不为“矩阵”写另一行,如
for (k in lf) y[[k]] <- x[, ind[[k]], drop = FALSE]
到目前为止,唯一安全地按列拆分矩阵的方法是先对其进行转置,然后依次进行split.data.frame
和另一个转置。
lapply(split.data.frame(t(A), c(1, 1, 2)), t)
如果lapply(split.default(data.frame(A), c(1, 1, 2)), as.matrix)
是字符矩阵,则通过A
进行的另一种解决方法是有问题的。
.Internal(split(x, f))
如何工作?这真的是核心的核心!我将在下面举一个小例子进行解释:
set.seed(0)
f <- sample(factor(letters[1:3]), 10, TRUE)
# [1] c a b b c a c c b b
#Levels: a b c
x <- 0:9
基本上有3个步骤。为了提高可读性,每个步骤都提供了等效的R代码。
步骤1:制表(计算每个因子水平的发生率)
## a factor has integer mode so `tabulate` works
tab <- tabulate(f, nbins = nlevels(f))
[1] 2 4 4
步骤2:结果列表的存储分配
result <- vector("list", nlevels(f))
for (i in 1:length(tab)) result[[i]] <- vector(mode(x), tab[i])
names(result) <- levels(f)
我将对此列表进行注释,其中每行是一个列表元素,在此示例中为矢量,每条[ ]
是该矢量条目的占位符。
$a: [ ] [ ]
$b: [ ] [ ] [ ] [ ]
$c: [ ] [ ] [ ] [ ]
第3步:元素分配
现在,揭示一个因子的内部整数模式非常有用:
.f <- as.integer(f)
#[1] 3 1 2 2 3 1 3 3 2 2
我们需要扫描x
和.f
,并根据累加器缓冲区向量将x[i]
填充到result[[.f[i]]]
的正确条目中。
ab <- integer(nlevels(f)) ## accumulator buffer
for (i in 1:length(.f)) {
fi <- .f[i]
counter <- ab[fi] + 1L
result[[fi]][counter] <- x[i]
ab[fi] <- counter
}
在下图中,^
是指向被访问或更新的元素的指针。
## i = 1
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [0] [0] [0] ## on entry
^
$a: [ ] [ ]
$b: [ ] [ ] [ ] [ ]
$c: [0] [ ] [ ] [ ]
^
ab: [0] [0] [1] ## on exit
^
## i = 2
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [0] [0] [1] ## on entry
^
$a: [1] [ ]
^
$b: [ ] [ ] [ ] [ ]
$c: [0] [ ] [ ] [ ]
ab: [1] [0] [1] ## on exit
^
## i = 3
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [1] [0] [1] ## on entry
^
$a: [1] [ ]
$b: [2] [ ] [ ] [ ]
^
$c: [0] [ ] [ ] [ ]
ab: [1] [1] [1] ## on exit
^
## i = 4
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [1] [1] [1] ## on entry
^
$a: [1] [ ]
$b: [2] [3] [ ] [ ]
^
$c: [0] [ ] [ ] [ ]
ab: [1] [2] [1] ## on exit
^
## i = 5
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [1] [2] [1] ## on entry
^
$a: [1] [ ]
$b: [2] [3] [ ] [ ]
$c: [0] [4] [ ] [ ]
^
ab: [1] [2] [2] ## on exit
^
## i = 6
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [1] [2] [2] ## on entry
^
$a: [1] [5]
^
$b: [2] [3] [ ] [ ]
$c: [0] [4] [ ] [ ]
ab: [2] [2] [2] ## on exit
^
## i = 7
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [2] [2] [2] ## on entry
^
$a: [1] [5]
$b: [2] [3] [ ] [ ]
$c: [0] [4] [6] [ ]
^
ab: [2] [2] [3] ## on exit
^
## i = 8
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [2] [2] [3] ## on entry
^
$a: [1] [5]
$b: [2] [3] [ ] [ ]
$c: [0] [4] [6] [7]
^
ab: [2] [2] [4] ## on exit
^
## i = 9
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [2] [2] [4] ## on entry
^
$a: [1] [5]
$b: [2] [3] [8] [ ]
^
$c: [0] [4] [6] [7]
ab: [2] [3] [4] ## on exit
^
## i = 10
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [2] [3] [4] ## on entry
^
$a: [1] [5]
$b: [2] [3] [8] [9]
^
$c: [0] [4] [6] [7]
ab: [2] [4] [4] ## on exit
^