Question

例如，我得到了一个大数据表。

n <- 7
dt <- data.table(id_1=sample(1:10^(n-1),10^n,replace=TRUE), other=sample(letters[1:20],10^n,replace=TRUE), val=rnorm(10^n,mean=10^4,sd=1000))

> structure(dt)
        id_1 other       val

    1: 914718     o  9623.078  
    2: 695164     f 10323.943
    3:  53186     h 10930.825
    4: 496575     p  9964.064
    5: 474733     l 10759.779
   ---                       
9999996: 650001     p  9653.125
9999997: 225775     i  8945.636
9999998: 372827     d  8947.095
9999999: 268678     e  8371.433
10000000: 730810     i 10150.311

我想创建一个data.table，对于指标id_1的每个值只有一行，即在val列中具有最大值的那一行。

以下代码似乎有效：

dt[, .SD[which.max(val)], by = .(id_1)]

但是，对于大表来说这非常慢。有更快的方法吗？

Answer 1

我不确定如何在R中执行此操作，但是我要做的是逐行读取然后将这些行放入数据帧中。这非常快，并且很快就可以生成100 mb的文本文件。

import pandas as pd
filename ="C:/Users/xyz/Downloads/123456789.012-01-433.txt"
filename =filename

with open(filename, 'r') as f:
    sample =[]          #creating an empty array
    for line in f:
        tag=line[:45].split('|')[5] # its a condition, you dont need this.
        if tag == 'KV-C901':
            sample.append(line.split('|')) # writing those lines to an array table

print('arrays are appended and ready to create a dataframe out of an array')

Answer 2

从技术上讲，这是this question的副本，但是答案并没有真正解释，这样就可以了：

dt[dt[, .(which_max = .I[val == max(val)]), by = "id_1"]$which_max]

内在表达基本上可以找到，对于根据id_1的每个组，最大值的行索引，并简单地返回这些索引，以便可以将它们用作dt的子集。

但是，我很惊讶没有找到建议的答案：

setkey(dt, id_1, val)[, .SD[.N], by = "id_1"]

在我的机器上似乎同样快，但它要求对行进行排序。

快速获取大数据中每个指标的最大值行

2 个答案: