Question

分割非常大的文件并将其写入磁盘的最快方法是什么。

例如，如果我有

这样的数据

AppState

我希望用“chr”值分割。

我正在考虑以下列方式应用pandas方法：

chr    a_val    b_val   a_idx
2      1355     25d     abd
2      1785     25d     abd
2      1825     36g     ahj
3      1125     25d     abd
3      1568     25d     aky
3      2398     g67     abd
3      1125     25d     afd
3      1525     25d     abd
3 ....................
4 ..........
4 ........

熊猫非常快。但是，是否有任何其他基于unix，linux或python的处理方法可以以最快的方式完成。

谢谢，

Answer 1

使用列表理解的一种线性python方法：

[group.to_csv(data, 'data_' + index + '.txt', sep = '\t', header = True, index=False) for index, group in my_df.groupby('chr')]

Answer 2

使用awk并期望数据在chr列上排序：

$ awk '
NR==1 {                       # store the header 
    h=$0                      # to var h
    next
}
{
    if(p!=$1) {               # when chr changes
        close(p)              # close previous file
        p=$1                  # new chr, new file identifier
        $0=h ORS $0           # add header 
    }
    print > "data_" p ".txt"  # output record to file
}' file
$ cat data_2.txt              # sample output
chr    a_val    b_val   a_idx
2      1355     25d     abd
2      1785     25d     abd
2      1825     36g     ahj

如果文件未排序，您将在文件中获得额外的标头。在这种情况下，您可以：

$ awk '                    # commented only the modified parts
NR==1 {
    h=$0
    next
}
{
    if(p!=$1) {
        close(p)
        p=$1
        if((p in a)==0) {  # if current chr hasnt been seen before ie. new file
            $0=h ORS $0    # write the header 
            a[p]           # hash the chr to a
        }
    }
print >> "data_" p ".txt"  # append to the file
}' file

Answer 3

Unix / Linux方法：

head -1 my_file.txt && tail -n +2 my_file.txt | sort -n

head和tail此处将忽略my_file.txt中的标题并对其他行进行排序。

-n的

sort选项将按数值排序。

按唯一组分割文件的最快方法

3 个答案: