我有一个python脚本来合并具有相同格式的数据文件,只删除重复的标题,每三行之间添加两个新的空行,除了第一个实例是包含标题的前四行。
import glob
read_files = glob.glob("*.txt")
header_saved = False
linecnt=0
with open("merged_data.txt", "wb") as outfile:
for f in read_files:
with open(f, "rb") as infile:
header = next(infile)
if not header_saved:
outfile.write(header)
header_saved = True
for line in infile:
outfile.write(line)
linecnt=linecnt+1
if (linecnt%3)==0:
outfile.write("\n\n")
示例输入文件文本(infile 1):
Specimen_ID Measured_by_initals Measure_date Sex Beak_length Pronotal_width Right_fore_femur_length Right_fore_femur_width Left_fore_femur_length Left_fore_femur_width Right_hind_femur_length Right_hind_femur_width Left_hind_femur_length Left_hind_femur_width Right_hind_femur_area Left_hind_femur_area Right_hind_tibia_width Left_hind_tibia_width Notes
a 1 30-Dec-16 M 4 4 4 4 4 4 4 4 4 4 4 4 4 4
b 1 30-Dec-16 F 4 4 4 4 4 4 4 4 4 4 4 4 4 4 beak bent
c 1 30-Dec-16 M 4 4 4 4 4 4 4 4 4 4 4 4 4 4
d 1 30-Dec-16 F 4 4 4 4 4 4 4 4 4 4 4 4 4 4
e 1 30-Dec-16 F 4 4 4 4 4 4 4 4 4 4 4 4 4 4 pronotum deformed
f 1 30-Dec-16 F 4 4 4 4 4 4 4 4 4 4 4 4 4 4
示例输入文件文本(infile 2):
Specimen_ID Measured_by_initals Measure_date Sex Beak_length Pronotal_width Right_fore_femur_length Right_fore_femur_width Left_fore_femur_length Left_fore_femur_width Right_hind_femur_length Right_hind_femur_width Left_hind_femur_length Left_hind_femur_width Right_hind_femur_area Left_hind_femur_area Right_hind_tibia_width Left_hind_tibia_width Notes
a 2 30-Dec-16 M 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
b 2 30-Dec-16 F 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
c 2 30-Dec-16 M 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
d 2 30-Dec-16 F 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
e 2 30-Dec-16 F 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
f 2 30-Dec-16 F 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
我现在想要修改脚本,以便它按Specimen_ID
对输出进行排序,同时在每三行之间保留两个空行(即,每个唯一行之后应该有两个空行{{ 1}})。有关排序行的任何建议吗?我在排序多维数据或python列表时看到了很多,但在2D表上却没有多少。
另外,我遇到了一些奇怪的行为,如果我在Excel中以制表符分隔的txt文件导出我的数据,这个脚本只会导致输出包含第一个infile的内容,而不是其他内容。但是,如果我将此网站的示例数据复制并粘贴到txt文件中并将其用作infiles,我就没有问题。有谁知道为什么我遇到这个问题?
答案 0 :(得分:0)
我已将测试数据更改为已逐行列出。这大致相当于readlines()返回的内容:
data_1 = """
Specimen_ID Measured_by_initals Measure_date Sex Beak_length Pronotal_width Right_fore_femur_length Right_fore_femur_width Left_fore_femur_length Left_fore_femur_width Right_hind_femur_length Right_hind_femur_width Left_hind_femur_length Left_hind_femur_width Right_hind_femur_area Left_hind_femur_area Right_hind_tibia_width Left_hind_tibia_width Notes
a 1 30-Dec-16 M 4 4 4 4 4 4 4 4 4 4 4 4 4 4
b 1 30-Dec-16 F 4 4 4 4 4 4 4 4 4 4 4 4 4 4 beak bent
c 1 30-Dec-16 M 4 4 4 4 4 4 4 4 4 4 4 4 4 4
d 1 30-Dec-16 F 4 4 4 4 4 4 4 4 4 4 4 4 4 4
e 1 30-Dec-16 F 4 4 4 4 4 4 4 4 4 4 4 4 4 4 pronotum deformed
f 1 30-Dec-16 F 4 4 4 4 4 4 4 4 4 4 4 4 4 4
""".split('\n')[1:-1]
data_2 = """
Specimen_ID Measured_by_initals Measure_date Sex Beak_length Pronotal_width Right_fore_femur_length Right_fore_femur_width Left_fore_femur_length Left_fore_femur_width Right_hind_femur_length Right_hind_femur_width Left_hind_femur_length Left_hind_femur_width Right_hind_femur_area Left_hind_femur_area Right_hind_tibia_width Left_hind_tibia_width Notes
a 2 30-Dec-16 M 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
b 2 30-Dec-16 F 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
c 2 30-Dec-16 M 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
d 2 30-Dec-16 F 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
e 2 30-Dec-16 F 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
f 2 30-Dec-16 F 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1
""".split('\n')[1:-1]
通过在写回任何数据之前读入所有数据,该程序无需计算行数:
headers = []
data = {}
# Go through the data for each file and sort by specimen id
for file_data in (data_1, data_2):
headers.append(file_data[0])
for line in file_data[1:]:
# specimen id is first column of space separated data
specimen_id = line.split(' ', 1)[0].strip()
# store each line in a list per specimen id
if specimen_id not in data:
data[specimen_id] = []
data[specimen_id].append(line)
# output the merged data
with open("merged_data.txt", "wb") as outfile:
for specimen_id in sorted(data):
outfile.write(headers[0] + '\n')
for line in data[specimen_id]:
outfile.write(line + '\n')
outfile.write("\n\n")
答案 1 :(得分:0)
我可能会建议您使用pandas来处理表格数据,因为您可以使用from_csv()
轻松读取数据,然后调用sort_values(by='Specimen ID')
,然后迭代输出打印出换行符。
假设这些输入文件是制表符分隔文件,请按照以下pandas
的方式阅读这些文件并对其进行排序:
import pandas as pd
import glob
try:
from io import StringIO
except ImportError:
from StringIO import StringIO
dfs = []
for infile in glob.glob('*.txt'):
# Infile can be a file path or an open file object
df = pd.read_csv(infile, delimiter='\t')
dfs.append(df)
df = pd.concat(dfs) # Combine all the dataframes you loaded in.
df.sort_values(by='Specimen_ID')
# Write this to an intermediate StringIO object before the next step.
o_s = StringIO()
df.to_csv(o_s, sep='\t', index=False)
o_s.seek(0)
lines = o_s.readines() # Get CSV as a list of lines.
此时,您想要将它们转储出去。如果没有要求他们每3行都有一个空白行,那么你只需要df.to_csv('merged_text.csv', sep='\t', index=False)
并且你做得很好(sep
使其以制表符分隔,index
因为{ {1}}在您阅读时会添加一个数字索引并且您不希望将其写出来,因为它没有意义),而是我们将其读入一个行列表中,以便我们可以迭代它们并根据需要写出额外的行:
pandas
如果您不想使用# This will read through o_s 3 lines at a time and then append a blank "line"
# before writing it.
with open('merged_data.txt', 'w') as f:
f.writelines(lines[0]) # Write the header line
for ii in range(1, len(lines) // 3):
# Write three lines at a time after the header, then an extra newline
f.writelines(lines[(3 * ii + 1):(3 * (ii + 1) + 1)] + ['\n'])
,可以尝试csv
模块:
pandas
获得import csv
from operator import itemgetter
lines_in = []
header_line = None
for infile in glob.glob('*.txt'):
with open(infile, 'r') as f:
reader = csv.reader(f, delimiter='\t')
first_line = next(reader)
if header_line is None:
header_line = first_line
# Append all the lines
lines_in += list(reader)
# Making the assumption that Specimen_ID is always the first column
lines = sorted(lines, key=itemgetter(0))
# Write this out as a well-formatted CSV
o_s = StringIO()
writer = csv.writer(o_s, delimiter='\t')
writer.writerow(header_line)
writer.writerows(lines)
lines = o_s.readlines()
后,您可以使用我上面使用的相同代码将其写入输出文件。