我有一个包含两个变量和一个因子列的data.frame。然后我计算了这个data.frame的一个子集,并想重新排序其余的因素。我在下面找到了解决方案。但实际数字会减慢。那么如何重新排序我的因素呢?
这是一个循序渐进的例子:
library(plyr)
library(ggplot2)
# generate an example data.frame
# x and y are integers, l is a factor
df <- data.frame(x=rep(1:5, each=4), y=seq(1:5), l=factor(letters[seq( from = 1, to = 10 )]))
df <- df[seq(1:17),]
df
x y l
1 1 1 a
2 1 2 b
3 1 3 c
4 1 4 d
5 2 5 e
6 2 1 f
7 2 2 g
8 2 3 h
9 3 4 i
10 3 5 j
11 3 1 a
12 3 2 b
13 4 3 c
14 4 4 d
15 4 5 e
16 4 1 f
17 5 2 g
现在我计算一个临时data.frame,我将用它来选择df的子集:
# computing temporary data.frame
df2 <- ddply(df, .(l), summarize, sum=sum(y))
df2$pct <- df2$sum / sum(df2$sum) * 100
df2
l sum pct
1 a 2 4.166667
2 b 4 8.333333
3 c 6 12.500000
4 d 8 16.666667
5 e 10 20.833333
6 f 2 4.166667
7 g 4 8.333333
8 h 3 6.250000
9 i 4 8.333333
10 j 5 10.416667
# select only those letters with "high enough" y-value
df2.selected <- df2[df2$pct > 10,]
df2.selected
l sum pct
3 c 6 12.50000
4 d 8 16.66667
5 e 10 20.83333
10 j 5 10.41667
# use only those letters which occur in df2.selected$l
df.subset <- df[df$l %in% df2.selected$l,]
df.subset
x y l
3 1 3 c
4 1 4 d
5 2 5 e
10 3 5 j
13 4 3 c
14 4 4 d
15 4 5 e
我摆脱了因素的现在未使用的值:
# get rid of unused values of l
df.subset$l <- factor(df.subset$l)
str(df.subset)
'data.frame': 7 obs. of 3 variables:
$ x: int 1 1 2 3 4 4 4
$ y: int 3 4 5 5 3 4 5
$ l: Factor w/ 4 levels "c","d","e","j": 1 2 3 4 1 2 3
我的子集 - facotr的新顺序应该是这个(我需要这个用于下面的facet_wrap):
# the new order of the factor variable should be the (inverse) order of sum
df2.selected <- df2.selected[order(-df2.selected$sum),]
df2.selected
l sum pct
5 e 10 20.83333
4 d 8 16.66667
3 c 6 12.50000
10 j 5 10.41667
# that should be the new order of the factor variable l: e, d, c, j
# get rid of unused values of l
df2.selected$l <- factor(df2.selected$l)
df2.selected
l sum pct
5 e 10 20.83333
4 d 8 16.66667
3 c 6 12.50000
10 j 5 10.41667
str(df2.selected)
'data.frame': 4 obs. of 3 variables:
$ l : Factor w/ 4 levels "c","d","e","j": 3 2 1 4
$ sum: int 10 8 6 5
$ pct: num 20.8 16.7 12.5 10.4
# Here I need the order e, f, c, j!
ggplot(data=df.subset, aes(x=x, y=y)) + geom_point() + facet_wrap(~l)
# so merged both -- This is the problem. It's too expensive. Is there a better way?
df.merged <- merge(df.subset, df2.selected, by=c('l'))
df.merged$l <- reorder(df.merged$l, -df.merged$sum)
df.merged
l x y sum pct
1 c 1 3 6 12.50000
2 c 4 3 6 12.50000
3 d 1 4 8 16.66667
4 d 4 4 8 16.66667
5 e 2 5 10 20.83333
6 e 4 5 10 20.83333
7 j 3 5 5 10.41667
str(df.merged)
'data.frame': 7 obs. of 5 variables:
$ l : Factor w/ 4 levels "e","d","c","j": 3 3 2 2 1 1 4
..- attr(*, "scores")= num [1:4(1d)] -6 -8 -10 -5
.. ..- attr(*, "dimnames")=List of 1
.. .. ..$ : chr "c" "d" "e" "j"
$ x : int 1 4 1 4 2 4 3
$ y : int 3 3 4 4 5 5 5
$ sum: int 6 6 8 8 10 10 5
$ pct: num 12.5 12.5 16.7 16.7 20.8 ...
ggplot(data=df.merged, aes(x=x, y=y)) + geom_point() + facet_wrap(~l)
答案 0 :(得分:0)
以下是data.table
的解决方案应该相对较快:
library(data.table)
dt <- data.table(df, key="l")
keep.lvls <- as.character(
dt[, list(sum=sum(y)), by=l][, # get the sums for each group
pct:=sum/sum(sum) * 100][ # pct for each group
pct > 10][ # only keep those greater than 10
order(pct, decreasing=T), l] # order by pct, pull out `l` only
)
str(dt.final <-
dt[
keep.lvls,][, # only keep `keep.lvls` from `dt`
l:=factor(l, levels=keep.lvls)]) # reset factors on `dt` to have `keep.lvls` levels
产生:
Classes ‘data.table’ and 'data.frame': 8 obs. of 3 variables:
$ l: Factor w/ 4 levels "e","j","d","i": 1 1 2 2 3 3 4 4
$ x: int 2 4 3 5 1 4 3 5
$ y: int 5 5 5 5 4 4 4 4
- attr(*, ".internal.selfref")=<externalptr>
请注意,这些答案与您的答案略有不同,因为我们有不同的随机数据。这是set.seed(1)
。