所以我一直在研究一种将一些信息组合成“床”格式的Python脚本。这意味着我正在使用基因组上的功能,我的第一列是脚手架名称(字符串),第二列是脚手架上的起始位置(整数),第三列是停止位置(整数),其他列包含与我的问题无关的其他信息。 我的问题是我的输出没有排序。
现在我知道我可以使用这个bash命令对文件进行排序:
$sort -k1,1 -k2,2n -k3,3n infile > outfile
但是在兴趣方面,我想知道是否有办法在Python中实现这一点。到目前为止,我只看到基于列表的排序,处理词典或数字排序。不是两者的结合。 那么,你们有什么想法吗?
我的数据片段(我想按第1,2和3列排序(按此顺序)):
Scf_3R 8599253 8621866 FBgn0000014 FBgn0191744 -0.097558026153
Scf_3R 8497493 8503049 FBgn0000015 FBgn0025043 0.437973284047
Scf_3L 16209309 16236428 FBgn0000017 FBgn0184183 -1.19105585707
Scf_2L 10630469 10632308 FBgn0000018 FBgn0193617 0.073153454539
Scf_3R 12087670 12124207 FBgn0000024 FBgn0022516 -0.023946795475
Scf_X 14395665 14422243 FBgn0000028 FBgn0187465 0.00300558969397
Scf_3R 25163062 25165316 FBgn0000032 FBgn0189058 0.530118698187
Scf_3R 19757441 19808894 FBgn0000036 FBgn0189822 -0.282508464261
答案 0 :(得分:2)
加载数据,使用sorted
对其进行排序,然后写入新文件。
# Load data
lists = list()
with open(filename, 'r') as f:
for line in f:
lists.append(line.rstrip().split())
# Sort data
results = sorted(lists, key=lambda x:(x[0], int(x[1]), int(x[2])))
# Write to a file
import csv
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
writer.writerows(results)
答案 1 :(得分:2)
要按您自己的排序条件排序,只需传递相应的key
函数:
with open('infile', 'rb') as file:
lines = file.readlines()
def sort_key(line):
fields = line.split()
try:
return fields[0], int(fields[1]), int(fields[2])
except (IndexError, ValueError):
return () # sort invalid lines together
lines.sort(key=sort_key)
with open('outfile', 'wb') as file:
file.writelines(lines)
它假定输入文件的末尾有换行符(如果需要,可以附加它)。
代码按字节值对文本数据进行排序(如果第一列是ASCII,则可以),以文本模式打开文件(在Python 2上使用io.open()
)如果不是这样的话(对于按Unicode代码点值排序)。 shell中sort
命令的结果可能取决于区域设置。你可以use PyICU collator in Python。
如果您需要对不适合内存的文件进行排序,请参阅Sorting text file by using Python
答案 2 :(得分:0)
@ sparkandshine的解决方案似乎很简短,并且针对特定的排序模式。 @ j-f-sebastian向我提供的那个看起来很棒,简洁,并且有关于国际化和内存分类策略的重要提示/链接。
也许以下更明确的展示案例为OP或具有类似任务的人员提供了更多有用的信息以适应他们的需求。请参阅主要符合pep8的代码中的注释:
#! /usr/bin/env python
"""Provide a show case for hierarchical sort, that offers flexible
hierarchical lexcical, numeric column sort mixes at runtime.
Hopefully this draft solution offers ideas for helping migrate
the sort shell level operation into a pythonic solution - YMMV."""
from __future__ import print_function
from functools import partial # We use this to tailor the key function
def text_in_lines_gen(text_in_lines):
"""Mock generator simulating a line source for the data."""
for line in text_in_lines.split('\n'):
if line:
yield line.split()
def sort_hier_gen(iterable_lines, hier_sort_spec):
"""Given iterator of text lines, sort all lines based on
sort specification in hier_sort_spec.
Every entry in hier_sort_spec is expected to be a pair with first value
integer for index in columns of text blocks lines and second entry
type of sorting in ('int', 'float') numeric or any other for text
(lexical) ordering regime."""
num_codes = ('int', 'float')
converter_map = dict(zip(num_codes, (int, float)))
# Extract facts from sort spec, prepare processing:
key_ordered = tuple(k for k, _ in hier_sort_spec)
# Prepare key function: Step 1 ...
def _key_in(selected, r):
"""Inject the indexing into the key at sort time
via partial application, as key function in sort
has single argument only."""
return tuple(r[k] for k in selected)
_key = partial(_key_in, key_ordered) # ... step 2
convert_these_by = {}
for k, t in hier_sort_spec:
if t in num_codes:
convert_these_by[k] = converter_map[t]
if not convert_these_by: # early out
for row in sorted(iterable_lines, key=_key):
yield row
else:
def flow_converter(row_iter, converter_map):
"""Row based converter - Don't block the flow ;-)."""
for row in row_iter:
for k, convert in converter_map.items():
row[k] = convert(row[k])
yield row
for row in sorted(flow_converter(iterable_lines,
convert_these_by), key=_key):
yield row
def main():
"""Drive the hierarchical text-int-int sort."""
data_1 = """Scf_3R 8599253 8621866 FBgn0000014 FBgn0191744 -0.097558026153
Scf_3R 8497493 8503049 FBgn0000015 FBgn0025043 0.437973284047
Scf_3L 16209309 16236428 FBgn0000017 FBgn0184183 -1.19105585707
Scf_2L 10630469 10632308 FBgn0000018 FBgn0193617 0.073153454539
Scf_3R 12087670 12124207 FBgn0000024 FBgn0022516 -0.023946795475
Scf_X 14395665 14422243 FBgn0000028 FBgn0187465 0.00300558969397
Scf_3R 25163062 25165316 FBgn0000032 FBgn0189058 0.530118698187
Scf_3R 19757441 19808894 FBgn0000036 FBgn0189822 -0.282508464261"""
bar = []
x = 0
for a in range(3, 0, -1):
for b in range(3, 0, -1):
for c in range(3, 0, -1):
x += 1
bar.append('a_%d %d %0.1f %d' % (a, b, c * 1.1, x))
data_2 = '\n'.join(bar)
hier_sort_spec = ((0, 't'), (1, 'int'), (2, 'int'))
print("# Test data set 1 and sort spec={0}:".format(hier_sort_spec))
for sorted_row in sort_hier_gen(text_in_lines_gen(data_1), hier_sort_spec):
print(sorted_row)
hier_sort_spec = ((0, 't'), (1, None), (2, False))
print("# Test data set 1 and sort spec={0}:".format(hier_sort_spec))
for sorted_row in sort_hier_gen(text_in_lines_gen(data_1), hier_sort_spec):
print(sorted_row)
hier_sort_spec = ((0, 't'), (2, 'float'), (1, 'int'))
print("# Test data set 2 and sort spec={0}:".format(hier_sort_spec))
for sorted_row in sort_hier_gen(text_in_lines_gen(data_2), hier_sort_spec):
print(sorted_row)
if __name__ == '__main__':
main()
在我的机器上,三个测试用例(包括问题样本数据)产生:
首先:
# Test data set 1 and sort spec=((0, 't'), (1, 'int'), (2, 'int')):
['Scf_2L', 10630469, 10632308, 'FBgn0000018', 'FBgn0193617', '0.073153454539']
['Scf_3L', 16209309, 16236428, 'FBgn0000017', 'FBgn0184183', '-1.19105585707']
['Scf_3R', 8497493, 8503049, 'FBgn0000015', 'FBgn0025043', '0.437973284047']
['Scf_3R', 8599253, 8621866, 'FBgn0000014', 'FBgn0191744', '-0.097558026153']
['Scf_3R', 12087670, 12124207, 'FBgn0000024', 'FBgn0022516', '-0.023946795475']
['Scf_3R', 19757441, 19808894, 'FBgn0000036', 'FBgn0189822', '-0.282508464261']
['Scf_3R', 25163062, 25165316, 'FBgn0000032', 'FBgn0189058', '0.530118698187']
['Scf_X', 14395665, 14422243, 'FBgn0000028', 'FBgn0187465', '0.00300558969397']
第二
# Test data set 1 and sort spec=((0, 't'), (1, None), (2, False)):
['Scf_2L', '10630469', '10632308', 'FBgn0000018', 'FBgn0193617', '0.073153454539']
['Scf_3L', '16209309', '16236428', 'FBgn0000017', 'FBgn0184183', '-1.19105585707']
['Scf_3R', '12087670', '12124207', 'FBgn0000024', 'FBgn0022516', '-0.023946795475']
['Scf_3R', '19757441', '19808894', 'FBgn0000036', 'FBgn0189822', '-0.282508464261']
['Scf_3R', '25163062', '25165316', 'FBgn0000032', 'FBgn0189058', '0.530118698187']
['Scf_3R', '8497493', '8503049', 'FBgn0000015', 'FBgn0025043', '0.437973284047']
['Scf_3R', '8599253', '8621866', 'FBgn0000014', 'FBgn0191744', '-0.097558026153']
['Scf_X', '14395665', '14422243', 'FBgn0000028', 'FBgn0187465', '0.00300558969397']
第三
# Test data set 2 and sort spec=((0, 't'), (2, 'float'), (1, 'int')):
['a_1', 1, 1.1, '27']
['a_1', 2, 1.1, '24']
['a_1', 3, 1.1, '21']
['a_1', 1, 2.2, '26']
['a_1', 2, 2.2, '23']
['a_1', 3, 2.2, '20']
['a_1', 1, 3.3, '25']
['a_1', 2, 3.3, '22']
['a_1', 3, 3.3, '19']
['a_2', 1, 1.1, '18']
['a_2', 2, 1.1, '15']
['a_2', 3, 1.1, '12']
['a_2', 1, 2.2, '17']
['a_2', 2, 2.2, '14']
['a_2', 3, 2.2, '11']
['a_2', 1, 3.3, '16']
['a_2', 2, 3.3, '13']
['a_2', 3, 3.3, '10']
['a_3', 1, 1.1, '9']
['a_3', 2, 1.1, '6']
['a_3', 3, 1.1, '3']
['a_3', 1, 2.2, '8']
['a_3', 2, 2.2, '5']
['a_3', 3, 2.2, '2']
['a_3', 1, 3.3, '7']
['a_3', 2, 3.3, '4']
['a_3', 3, 3.3, '1']
更新主要使用生成器只有一个数据副本“around”,因为无论如何(内存中)需要全局排序,但不需要更多副本; - )
还添加了functools.partial
,因为这对我来说是最快的方法,可以将关键功能调整为灵活的排序顺序。
通过定义基于行的转换的本地生成器函数,最后一次更新在实现转换的情况下移除了剩余的非生成器副本。 HTH。