使用Python脚本按标本ID排序

时间:2017-01-04 13:54:19

标签: python sorting

我有一个python脚本来合并具有相同格式的数据文件,只删除重复的标题,每三行之间添加两个新的空行,除了第一个实例是包含标题的前四行。

import glob

read_files = glob.glob("*.txt")

header_saved = False
linecnt=0
with open("merged_data.txt", "wb") as outfile:
    for f in read_files:
        with open(f, "rb") as infile:
            header = next(infile)
            if not header_saved:
                outfile.write(header)
                header_saved = True
            for line in infile:
                outfile.write(line)
                linecnt=linecnt+1
                if (linecnt%3)==0:
                    outfile.write("\n\n")

示例输入文件文本(infile 1):

Specimen_ID Measured_by_initals Measure_date    Sex Beak_length Pronotal_width  Right_fore_femur_length Right_fore_femur_width  Left_fore_femur_length  Left_fore_femur_width   Right_hind_femur_length Right_hind_femur_width  Left_hind_femur_length  Left_hind_femur_width   Right_hind_femur_area   Left_hind_femur_area    Right_hind_tibia_width  Left_hind_tibia_width   Notes
a   1   30-Dec-16   M   4   4   4   4   4   4   4   4   4   4   4   4   4   4   
b   1   30-Dec-16   F   4   4   4   4   4   4   4   4   4   4   4   4   4   4   beak bent
c   1   30-Dec-16   M   4   4   4   4   4   4   4   4   4   4   4   4   4   4   
d   1   30-Dec-16   F   4   4   4   4   4   4   4   4   4   4   4   4   4   4   
e   1   30-Dec-16   F   4   4   4   4   4   4   4   4   4   4   4   4   4   4   pronotum deformed
f   1   30-Dec-16   F   4   4   4   4   4   4   4   4   4   4   4   4   4   4

示例输入文件文本(infile 2):

Specimen_ID Measured_by_initals Measure_date    Sex Beak_length Pronotal_width  Right_fore_femur_length Right_fore_femur_width  Left_fore_femur_length  Left_fore_femur_width   Right_hind_femur_length Right_hind_femur_width  Left_hind_femur_length  Left_hind_femur_width   Right_hind_femur_area   Left_hind_femur_area    Right_hind_tibia_width  Left_hind_tibia_width   Notes
a   2   30-Dec-16   M   4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 
b   2   30-Dec-16   F   4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 
c   2   30-Dec-16   M   4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 
d   2   30-Dec-16   F   4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 
e   2   30-Dec-16   F   4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 
f   2   30-Dec-16   F   4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 

我现在想要修改脚本,以便它按Specimen_ID对输出进行排序,同时在每三行之间保留两个空行(即,每个唯一行之后应该有两个空行{{ 1}})。有关排序行的任何建议吗?我在排序多维数据或python列表时看到了很多,但在2D表上却没有多少。

另外,我遇​​到了一些奇怪的行为,如果我在Excel中以制表符分隔的txt文件导出我的数据,这个脚本只会导致输出包含第一个infile的内容,而不是其他内容。但是,如果我将此网站的示例数据复制并粘贴到txt文件中并将其用作infiles,我就没有问题。有谁知道为什么我遇到这个问题?

2 个答案:

答案 0 :(得分:0)

我已将测试数据更改为已逐行列出。这大致相当于readlines()返回的内容:

data_1 = """
Specimen_ID Measured_by_initals Measure_date    Sex Beak_length Pronotal_width  Right_fore_femur_length Right_fore_femur_width  Left_fore_femur_length  Left_fore_femur_width   Right_hind_femur_length Right_hind_femur_width  Left_hind_femur_length  Left_hind_femur_width   Right_hind_femur_area   Left_hind_femur_area    Right_hind_tibia_width  Left_hind_tibia_width   Notes
a   1   30-Dec-16   M   4   4   4   4   4   4   4   4   4   4   4   4   4   4
b   1   30-Dec-16   F   4   4   4   4   4   4   4   4   4   4   4   4   4   4   beak bent
c   1   30-Dec-16   M   4   4   4   4   4   4   4   4   4   4   4   4   4   4
d   1   30-Dec-16   F   4   4   4   4   4   4   4   4   4   4   4   4   4   4
e   1   30-Dec-16   F   4   4   4   4   4   4   4   4   4   4   4   4   4   4   pronotum deformed
f   1   30-Dec-16   F   4   4   4   4   4   4   4   4   4   4   4   4   4   4
""".split('\n')[1:-1]

data_2 = """
Specimen_ID Measured_by_initals Measure_date    Sex Beak_length Pronotal_width  Right_fore_femur_length Right_fore_femur_width  Left_fore_femur_length  Left_fore_femur_width   Right_hind_femur_length Right_hind_femur_width  Left_hind_femur_length  Left_hind_femur_width   Right_hind_femur_area   Left_hind_femur_area    Right_hind_tibia_width  Left_hind_tibia_width   Notes
a   2   30-Dec-16   M   4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
b   2   30-Dec-16   F   4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
c   2   30-Dec-16   M   4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
d   2   30-Dec-16   F   4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
e   2   30-Dec-16   F   4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
f   2   30-Dec-16   F   4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
""".split('\n')[1:-1]

通过在写回任何数据之前读入所有数据,该程序无需计算行数:

headers = []
data = {}

# Go through the data for each file and sort by specimen id
for file_data in (data_1, data_2):
    headers.append(file_data[0])
    for line in file_data[1:]:
        # specimen id is first column of space separated data
        specimen_id = line.split(' ', 1)[0].strip()

        # store each line in a list per specimen id
        if specimen_id not in data:
            data[specimen_id] = []
        data[specimen_id].append(line)

# output the merged data
with open("merged_data.txt", "wb") as outfile:
    for specimen_id in sorted(data):
        outfile.write(headers[0] + '\n')
        for line in data[specimen_id]:
            outfile.write(line + '\n')
        outfile.write("\n\n")

答案 1 :(得分:0)

我可能会建议您使用pandas来处理表格数据,因为您可以使用from_csv()轻松读取数据,然后调用sort_values(by='Specimen ID'),然后迭代输出打印出换行符。

假设这些输入文件是制表符分隔文件,请按照以下pandas的方式阅读这些文件并对其进行排序:

import pandas as pd
import glob
try:
    from io import StringIO
except ImportError:
    from StringIO import StringIO

dfs = []
for infile in glob.glob('*.txt'):
    # Infile can be a file path or an open file object
    df = pd.read_csv(infile, delimiter='\t')
    dfs.append(df)

df = pd.concat(dfs)     # Combine all the dataframes you loaded in.

df.sort_values(by='Specimen_ID')

# Write this to an intermediate StringIO object before the next step.
o_s = StringIO()
df.to_csv(o_s, sep='\t', index=False)
o_s.seek(0)
lines = o_s.readines()   # Get CSV as a list of lines.

此时,您想要将它们转储出去。如果没有要求他们每3行都有一个空白行,那么你只需要df.to_csv('merged_text.csv', sep='\t', index=False)并且你做得很好(sep使其以制表符分隔,index因为{ {1}}在您阅读时会添加一个数字索引并且您不希望将其写出来,因为它没有意义),而是我们将其读入一个行列表中,以便我们可以迭代它们并根据需要写出额外的行:

pandas

如果您不想使用# This will read through o_s 3 lines at a time and then append a blank "line" # before writing it. with open('merged_data.txt', 'w') as f: f.writelines(lines[0]) # Write the header line for ii in range(1, len(lines) // 3): # Write three lines at a time after the header, then an extra newline f.writelines(lines[(3 * ii + 1):(3 * (ii + 1) + 1)] + ['\n']) ,可以尝试csv模块:

pandas

获得import csv from operator import itemgetter lines_in = [] header_line = None for infile in glob.glob('*.txt'): with open(infile, 'r') as f: reader = csv.reader(f, delimiter='\t') first_line = next(reader) if header_line is None: header_line = first_line # Append all the lines lines_in += list(reader) # Making the assumption that Specimen_ID is always the first column lines = sorted(lines, key=itemgetter(0)) # Write this out as a well-formatted CSV o_s = StringIO() writer = csv.writer(o_s, delimiter='\t') writer.writerow(header_line) writer.writerows(lines) lines = o_s.readlines() 后,您可以使用我上面使用的相同代码将其写入输出文件。