如何根据特定的总和列值对大型数据表的行进行子集化?
require(data.table)
x <- data.table(frequency = c(10,9,8,7,6,5,4,3,2,1), names = c("ten", "nine", "eight", "seven", "six", "five", "four", "three", "two", "one"))
# Example: subset data.table to at least 90% of the frequency sum.
# Desired answer:
frequency names
1: 10 ten
2: 9 nine
3: 8 eight
4: 7 seven
5: 6 six
6: 5 five
7: 4 four
8: 3 three
答案 0 :(得分:1)
你是这个意思吗?
x[1:which.max(cumsum(frequency) > 0.9 * sum(frequency))]
frequency names
1: 10 ten
2: 9 nine
3: 8 eight
4: 7 seven
5: 6 six
6: 5 five
7: 4 four
8: 3 three
答案 1 :(得分:0)
根据数据框的大小,有两种选择:
1)简单形式:
require(data.table)
x <- data.table(frequency = c(10,9,8,7,6,5,4,3,2,1), names = c("ten", "nine", "eight", "seven", "six", "five", "four", "three", "two", "one"))
x$cumfreq <- cumsum(x$frequency)/sum(x$frequency)
print(x)
x <- subset(x, cumfreq <= .9)
print(x)
x$cumfreq # don't forget delete column for performance
gc()
和2)优雅:
require(data.table)
x <- data.table(frequency = c(10,9,8,7,6,5,4,3,2,1), names = c("ten", "nine", "eight", "seven", "six", "five", "four", "three", "two", "one"))
top <- quantile(x$frequency, probs = .1)
x <- subset(x, frequency> top )