我有一个像这样的大文本文件:
#RefName Pos Coverage
lcl|LGDX01000053.1_cds_KOV95322.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 0 0
lcl|LGDX01000053.1_cds_KOV95322.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 1 0
lcl|LGDX01000053.1_cds_KOV95322.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 2 1
lcl|LGDX01000053.1_cds_KOV95323.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 3 0
lcl|LGDX01000053.1_cds_KOV95323.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 4 0
lcl|LGDX01000053.1_cds_KOV95324.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 5 0
lcl|LGDX01000053.1_cds_KOV95324.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 6 101
lcl|LGDX01000053.1_cds_KOV95325.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 7 10
lcl|LGDX01000053.1_cds_KOV95325.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 8 0
第一行是标题,可以忽略或删除。 我有两个单独的目标:
1)我想提取最后一行中的值不为0的所有行。 2)我想按第一列分组,并在分组文件中:删除第二列,然后对最后一列求和。
我知道如何在熊猫中执行这些操作,但是文件大小为10G,因此加载到熊猫本身会很痛苦。
有没有一种干净的方法来做这些?喜欢使用bash或awk吗?
谢谢!
答案 0 :(得分:0)
香草python中的一种简单方法是仅读取文件,并按照硬编码格式逐行处理它:
sum_groups = 0
with open('groups_file.txt', 'w') as groups_file:
for line in open('large_text_file.txt', 'r'):
line_items = line.split(' ') # split into a list
if int(line_items[-1]) == 0: # ignore the line if last value is 0
continue
sum_groups += int(line_items[-2]) # add second-to-last column to sum
line_to_write = ' '.join(line_items[0:1] + line_items[2:]) + '\n'
groups_file.write(line_to_write) # write to file, after removing second column
Python的文件处理不会一次读入整个文件(我们一次只读一行,而当我们读下一行the previous one gets garbage-collected时也是如此),所以这也不应该占用除非组本身太大,否则内存很大。与编写文件类似,IIRC-如果需要,您可以简单地打开输出文件,然后直接从输入文件写入输出文件,而不必将结果附加到groups
,从而节省了更多内存。
这当然比批处理整个文件要慢,但是空间速度一直是计算的主要折衷。
答案 1 :(得分:0)
$ awk 'NR>1 && $NF {a[$1]+=$NF}
END {for(k in a) print k, a[k]}' file
lcl|LGDX01000053.1_cds_KOV95325.1_1 10
lcl|LGDX01000053.1_cds_KOV95324.1_1 101
lcl|LGDX01000053.1_cds_KOV95322.1_1 1
由于不匹配其他列,不能确保它们都相同,因此以这种方式汇总数据将只包含键和汇总数据。
说明
为此脚本查找awk
的语法基础
NR>1 && $NF
跳过标头(NR == 1)和最后零个字段
{a[$1]+=$NF}
用第一个字段作为键求和最后一个字段
END
最终
{for(k in a) print k, a[k]}
打印所有键值对