Question

因此，我想读取一个TSV文件（> 1M行）并打开另一个tsv文件，该文件将复制确切的数据，但重新排列列。

例如，

原始tsv文件：

A   B . . . . .H
a1  b1.. . . . h1
a2  b2. . . . .h2
a3  b3. . . . .h3
.   .. . . . . . so on.

（第一行是标题）

我知道如何创建，读取和写入文件，但是我不知道如何重新排列列。

file_location = 'abc.tsv'
output_filename = 'sample.tsv'


def main():
    file_reader = open(file_location,'r')
    new_file = open(output_filename,'w')

    for rows in file_reader:
        try:
                rows = rows.strip().split('\t')


        except Exception, e:
            print('Error in reading file: %s' % e)
            pass

    file_reader.close()
    new_file.close()


if __name__ == '__main__':
    main()

预期输出：

D   G . . . . B
d1  g1. . . . b1
d2  g2. . . . b2
d3  g3. . . . b3
d4  g4. . . . b4
.   . . . . . .
.  .  . . . . . so on.

任何想法都值得赞赏。谢谢。

Answer 1

正如我在评论中提到的那样，您可以使用csv模块来执行此操作。这也将是相当快的（请注意，文件的行或字段上没有显式循环，而且csv模块是用C编写的。）

例如：

import csv


file_location = 'abc.tsv'
output_filename = 'sample.tsv'
infields =  'A', 'B', 'C', 'D', 'G', 'H'
outfields = 'D', 'G', 'A', 'H', 'C', 'B'


def main():
    with open(file_location, 'r', newline='') as inp, \
         open(output_filename, 'w', newline='') as outp:

        reader = csv.DictReader(inp, fieldnames=infields, delimiter='\t')
        writer = csv.DictWriter(outp, fieldnames=outfields, delimiter='\t',
                                extrasaction='ignore')

        writer.writerows(reader)


if __name__ == '__main__':
    main()

Answer 2

您可以使用pandas轻松地做到这一点，只需将文件转换为pandas数据框，并根据需要更改数据框的列顺序，然后将其写回到TSV文件中即可。

要将文件读入熊猫数据框，请使用：

import pandas as pd    
df = pd.read_csv("abc.tsv", sep='\t', header=0)

您可以在here

中了解熊猫的基本知识。

Answer 3

类似的东西：

（我没有更改表头的位置）
我也跳过了对文件的读/写操作，因为我认为这对您来说不是挑战。

original_data = [['A','B','C'],['a1','b1','c1'],['a2','b2','c2']]

def switch_columns(column_pairs,entries):
  for pair in column_pairs:
    for idx,entry in enumerate(entries):
      if idx > 0: 
        temp = entry[pair[0]]
        entry[pair[0]] = entry[pair[1]]
        entry[pair[1]] = temp

print('Before:')
print(original_data)
switch_columns([(0,2)],original_data)
print('After:')
print(original_data)

输出

Before:
[['A', 'B', 'C'], ['a1', 'b1', 'c1'], ['a2', 'b2', 'c2']]
After:
[['A', 'B', 'C'], ['c1', 'b1', 'a1'], ['c2', 'b2', 'a2']]

解析.TSV文件并通过重新排列列将数据写入新的.TSV文件

3 个答案: