如何根据特定的总和列值对大数据表的行进行子集化?

时间:2019-10-14 02:43:36

标签: r data.table

如何根据特定的总和列值对大型数据表的行进行子集化?

require(data.table)
x <- data.table(frequency = c(10,9,8,7,6,5,4,3,2,1), names = c("ten", "nine", "eight", "seven", "six", "five", "four", "three", "two", "one"))

# Example: subset data.table to at least 90% of the frequency sum.

# Desired answer:

   frequency names
1:        10   ten
2:         9  nine
3:         8 eight
4:         7 seven
5:         6   six
6:         5  five
7:         4  four
8:         3 three

2 个答案:

答案 0 :(得分:1)

你是这个意思吗?

x[1:which.max(cumsum(frequency) > 0.9 * sum(frequency))]
   frequency names
1:        10   ten
2:         9  nine
3:         8 eight
4:         7 seven
5:         6   six
6:         5  five
7:         4  four
8:         3 three

答案 1 :(得分:0)

根据数据框的大小,有两种选择:

1)简单形式:

        require(data.table)
    x <- data.table(frequency = c(10,9,8,7,6,5,4,3,2,1), names = c("ten", "nine", "eight", "seven", "six", "five", "four", "three", "two", "one"))
    x$cumfreq <- cumsum(x$frequency)/sum(x$frequency)
    print(x)
    x <- subset(x, cumfreq <= .9)
    print(x)
    x$cumfreq # don't forget delete column for performance
    gc()

和2)优雅:

require(data.table)
x <- data.table(frequency = c(10,9,8,7,6,5,4,3,2,1), names = c("ten", "nine", "eight", "seven", "six", "five", "four", "three", "two", "one"))
top <- quantile(x$frequency, probs = .1)
x <- subset(x, frequency> top  )