快速提问。我正在以多种不同方式对变量进行分类,以进行探索性数据分析。我们假设我在data.frame var
中有一个名为df
的变量。
df$var<-c(1,2,8,9,4,5,6,3,6,9,3,4,5,6,7,8,9,2,3,4,6,1,2,3,7,8,9,0)
到目前为止,我已采用以下方法(代码如下):
#Divide into quartiles
df$var_quartile <- with(df, cut(var, breaks=quantile(var, probs=seq(0,1, by=.25)), include.lowest=TRUE))
# Values of var_quartile
> [0,3],[0,3],(7.25,9],(7.25,9],(3,5],(3,5],(5,7.25],[0,3],(5,7.25],(7.25,9],[0,3],(3,5],(3,5],(5,7.25],(5,7.25],(7.25,9],(7.25,9],[0,3],[0,3],(3,5],(5,7.25],[0,3],[0,3],[0,3]
#Bin into increments of 2
df$var_bin<- cut(df[['var']],2, include.lowest=TRUE, labels=1:2)
# Values of var_bin
> 1 1 2 2 1 2 2 1 2 2 1 1 2 2 2 2 2 1 1 1 2 1 1 1 2 2 2 1
我想做的最后一件事是在按照时间顺序排序之后将变量分成10个观察的部分。这是在找到中位数后进行分裂的相同方法(计算到中间观察值),只有我想以10次观察增量计算。
使用我的示例,这会将var
分成以下部分:
0,1,1,2,2,2,3,3,3,3
4,4,4,5,5,6,6,6,6,7
7,8,8,8,9,9,9
N.B。 - 我需要在非常大的数据集中运行此操作(通常是3-6百万个广泛的观察)。
我该怎么做?谢谢!
答案 0 :(得分:7)
cut_number()
旨在将数字向量剪切为包含相同数量的点的区间。在您的情况下,您可以这样使用它:
library(ggplot2)
split(var, cut_number(var, n=3, labels=1:3))
# $`1`
# [1] 1 2 3 3 2 3 1 2 3 0
#
# $`2`
# [1] 4 5 6 6 4 5 6 4 6
#
# $`3`
# [1] 8 9 9 7 8 9 7 8 9
答案 1 :(得分:4)
vec <- c(1,2,8,9,4,5,6,3,6,9,3,4,5,6,7,8,9,2,3,4,6,1,2,3,7,8,9,0) # your vector
nObs <- 10 # number of observations per bin
# create data labels
datLabels <- ceiling(seq_along(vec)/nObs)[rank(vec, ties.method = "first")]
# test data labels:
split(vec, datLabels)
$`1`
[1] 1 2 3 3 2 3 1 2 3 0
$`2`
[1] 4 5 6 6 4 5 6 7 4 6
$`3`
[1] 8 9 9 8 9 7 8 9
答案 2 :(得分:1)
你的意思是这样吗?
x <- sample(100)
binSize <- 10
table(floor(x/binSize)*binSize)
答案 3 :(得分:1)
我创建了相同大小的组而不使用剪切。
# number_of_groups_wanted = number of rows / divisor in ceiling code
# therefore divisor in ceiling code should be = number of rows / number_of_groups_wanted,
# divisor in ceiling code = (nrow(df)/number_of_groups_wanted)
# min assigns every tied element to the lowest rank
number_of_groups_wanted = 100 # put in the number of groups you want
df$group = ceiling(rank(df$var_to_group, ties.method = "min")/(nrow(df)/number_of_groups_wanted))
df$rank = rank(df$var_to_group, ties.method = "min") # this line is just used to check data
答案 4 :(得分:0)
这应该这样做。
df$var_bin<- cut(df[['var']], breaks = Size(df$var/10),
include.lowest=TRUE, labels=1:10)