Question

我的文本如下所示，大小约为6gb。我想保持#CHROM之前的行不变，但我希望对#CHROM行下面的所有行进行随机排序。有没有一种有效的内存存储方式？

##contig=<ID=chrX,length=155270560,assembly=hg19>
##contig=<ID=chrY,length=59373566,assembly=hg19>
##contig=<ID=chrM,length=16571,assembly=hg19>
##reference=file:///dmf/
##source=SelectVariants
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
chr1    14165   .       A       G       220.12  VQSRTrancheSNP99.90to10
chr1    14248   .       T       G       547.33  VQSRTrancheSNP99.90to10
chr1    14354   .       C       A       2942.62 VQSRTrancheSNP99.90to10
chr1    14374   .       A       G       17.90   VQSRTrancheSNP99.90to10

我想要的结果看起来像这样：

##contig=<ID=chrX,length=155270560,assembly=hg19>
##contig=<ID=chrY,length=59373566,assembly=hg19>
##contig=<ID=chrM,length=16571,assembly=hg19>
##reference=file:///dmf/
##source=SelectVariants
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
chr1    14354   .       C       A       2942.62 VQSRTrancheSNP99.90to10
chr1    14248   .       T       G       547.33  VQSRTrancheSNP99.90to10
chr1    14374   .       A       G       17.90   VQSRTrancheSNP99.90to10
chr1    14165   .       A       G       220.12  VQSRTrancheSNP99.90to10

Answer 1

我将按照您的条件分割文件，在第二个块上使用shuf，然后将它们重新组合在一起。我想不出能避免分裂的高效内存。

Answer 2

这是awk中的一个：

awk -v seed=$RANDOM '              # get a random seed to srand()
BEGIN {
    srand(seed)                    
}
/^#/ {                             # print all # starting, no need to store them to mem
    print                          # this could be more clever but not the point
    next                           # in this solution
}
{
    r=rand()                        # get a random value for hash key
    a[r]=a[r] (a[r]==""?"":ORS) $0  # hash to a, append if key collision
}
END {
    for(i in a)                     # in the end print in awk implementation default order
        print a[i]                  # randomness was created while hashing
}' file

AN输出：

##contig=<ID=chrX,length=155270560,assembly=hg19>
##contig=<ID=chrY,length=59373566,assembly=hg19>
##contig=<ID=chrM,length=16571,assembly=hg19>
##reference=file:///dmf/
##source=SelectVariants
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
chr1    14165   .       A       G       220.12  VQSRTrancheSNP99.90to10
chr1    14354   .       C       A       2942.62 VQSRTrancheSNP99.90to10
chr1    14248   .       T       G       547.33  VQSRTrancheSNP99.90to10
chr1    14374   .       A       G       17.90   VQSRTrancheSNP99.90to10

它将所有非#起始记录加载到内存中。如果您尝试过，请告诉我们内存映像的大小。

更新：

这里是另一个，对上面的内容做了小的修改：

BEGIN {
    srand(seed)
}
/^#/ {
    print
    next
}
{
    r=rand()
    a[r]=a[r] (a[r]==""?"":ORS) $0
}
NR>1000 {             # the first 1000 records are hashed above
    for(i in a) {     # after that 
        print a[i]    # a "random" one is printed
        delete a[i]   # and deleted from the hash
        break         # only a 1000 records is kept in memory
    }
}
END {
    for(i in a)
        print a[i]
}

由于我使用的NR包括#起始记录，因此1000条记录不是哈希中确切的记录数。选择一个自己喜欢的值。

以下是带有NR>10和seq 1 20的示例输出：

$ seq 1 20 | awk -v seed=$RANDOM -f script.awk  
3
9
13
2
1
16
17
14
10
20
15
7
19
5
18
12
11
6
4
8

Answer 3

由于您使用的是Linux，因此您可能需要使用GNU sort -R进行随机化。

GNU排序将在需要时自动使用磁盘空间而不是RAM，因此可以在RAM少得多的系统上对数百GB的数据进行排序/随机化。

如何以内存有效的方式随机化行？

3 个答案: