您好我有两个文件如下所示..我想从这两个文件中获得第三个文件。我能做到,但代码真的很慢......
文件1
1 469 - CG 17 19
1 471 - CG 19 19
1 483 + CG 1 1
1 484 - CG 20 23
1 488 + CG 2 2
file2的
1 468 + CG 5 6
1 469 - CG 25 31
1 470 + CG 4 6
1 471 - CG 22 31
1 483 + CG 10 10
1 484 - CG 36 43
file3的
1 468 0 0 5 6
1 469 17 19 25 31
1 470 0 0 4 6
1 471 19 19 22 31
1 483 1 1 10 10
1 484 20 23 36 43
1 488 2 2 12 12
我正在寻找一种更快的方法在python中执行它,因为文件非常大..
答案 0 :(得分:1)
pandas是用于数据集操作的通用工具箱。它包括高性能的连接操作。以下是文件合并的外观:
import pandas as pd
def split_45(df):
"""
Given a DataFrame, split column 4, which will contain
an oddball tab-separated set of values from an otherwise
fixed-width, space-separated dataset, into proper columns
4 and 5.
"""
tabcol = df[4].str.split("\t")
df[4] = tabcol.apply(lambda x: x[0])
df[5] = tabcol.apply(lambda x: x[1])
# read in datasets
d1 = pd.read_fwf("file1.txt", header=None)
d2 = pd.read_fwf("file2.txt", header=None)
# clean up the funky column 4 into 4 and 5
split_45(d1)
split_45(d2)
# delete undesired columns
del d1[2]
del d1[3]
del d2[2]
del d2[3]
# merge datasets, on the key field, unioning the keys (outer join),
# and sorting the results
d3 = pd.merge(left=d2, right=d1, on=[1], how='outer', sort=True)
# drop an unneeded column and fill the NaNs with 0
del d3['0_y']
d3.fillna(0, inplace=True)
# write fixed width text data to file
with open("file3.txt", "w") as f:
f.write(d3.to_string(header=False, index=False))
此代码比原本要长,因为您的数据似乎不是纯固定宽度格式,但包含一些分隔最终列的选项卡。 split_45
函数用于清理它们并将这些值拆分为单独的列。
在运行结束时,file3.txt
将包含:
0 468 0 0 5 6
1 469 17 19 25 31
0 470 0 0 4 6
1 471 19 19 22 31
1 483 1 1 10 10
1 484 20 23 36 43
1 488 2 2 0 0
请注意,这与最后一行中所需的输出略有不同。 OTOH,上面的输入不包含键12 12
的{{1}}值,因此488
是正确的结果,给定输入。
答案 1 :(得分:0)
试试这个
file1 = open('file1.txt', 'r')
file2 = open('file2.txt', 'r')
file1Dict = {i[1]: [i[4], i[5]] for i in [line.split() for line in file1.read().splitlines()]}
file2Dict = {i[1]: [i[4], i[5]] for i in [line.split() for line in file2.read().splitlines()]}
for key in sorted(list(set(file1Dict) | set(file2Dict))):
file1Value = file1Dict[key] if key in file1Dict else ['0', '0']
file2Value = file2Dict[key] if key in file2Dict else ['0', '0']
print ' '.join(['1', key] + file1Value + file2Value)
输出:
1 468 0 0 5 6
1 469 17 19 25 31
1 470 0 0 4 6
1 471 19 19 22 31
1 483 1 1 10 10
1 484 20 23 36 43
1 488 2 2 12 12
1 489 25 25 43 46
1 492 2 2 11 13
1 493 22 27 41 47
1 496 4 4 17 18
1 497 26 30 41 44
1 524 5 6 21 21
1 525 25 27 33 33
1 541 9 11 31 31
1 542 24 26 0 0