python中的问题,同时合并2个文本文件并将它们汇总成一个新的文本文件

时间:2018-11-08 10:50:45

标签: python

我有2个文本文件,例如以下示例。我将其中一个命名为firstcomma separated),将另一个命名为secondtab separated)。

first

chr1,105000000,105310000,2,1,3,2
chr1,5310000,5960000,2,1,5,4
chr1,1580000,1180000,4,1,5,3
chr19,107180000,107680000,1,1,5,4
chr1,7680000,8300000,3,1,1,2
chr1,109220000,110070000,4,2,3,3
chr1,11060000,12070000,6,2,7,4

second

AKAP8L  chr19   107180100   107650000   transcript
AKAP8L  chr19   15514130    15529799    transcript
AKIRIN2 chr6    88384790    88411927    transcript
AKIRIN2 chr6    88410228    88411243    transcript
AKT3    chr1    105002000   105010000   transcript
AKT3    chr1    243663021   244006886   transcript
AKT3    chr1    243665065   244013430   transcript
第一个文件列23中的

是开始和结束。第二个文件列中的34分别是开始和结束。我想从第一个和第二个文件中创建一个新的文本文件。 在新文件中,我要根据以下条件(3列)计算file second中与file first中每一行相匹配的行数:

1- the 1st column in file first is equal to 2nd column in file second.
2- the 3rd column in the file second is greater than the the 2nd column in the file first and also smaller than the 3rd column in the file first.
3- the 4th column in the file second should be also greater than the the 2nd column in the file first and also smaller than the 3rd column in the file first.

实际上,输出看起来像预期的输出。前7列直接来自file first,而第9列是file second中与file first中的每一行相匹配的行数(基于上述3条标准)。并且8th column将是“ file second中该行的第一列,该行首先与文件的特定行匹配”

expected output

chr19,107180000,107680000,1,1,5,4,AKAP8L, 1
chr1,105000000,105310000,2,1,3,2, AKT3, 1

我试图用python做到这一点并编写了这段代码,但是它没有返回我想要的东西。

first = open('first.csv', 'rb')
second = open('second.txt', 'rb')
first_file = []
for line in first:
    first_file.append(line.split(','))

second_file = []
for line2 in second:
    second_file.append(line.split())

count=0
final = []
for i in range(len(first_file)):
    for j in range(len(second_file)):
        first_row = first_file[i]
        second_row = second_file[j]
        first_col = first_row.split()
        second_col = second_row.split()
        if first_col[0] == second_col[1] and first_col[1] < second_col[2] < first_col[2] and first_col[1] < second_col[3] < first_col[2]
            count+=1
            final.append(first_col[i]+second_col[0]+count)

2 个答案:

答案 0 :(得分:2)

鉴于您没有列名,因此看起来确实很健壮,但是它可以工作并且使用pandas

import pandas as pd

first = 'first.csv'
second = 'second.txt'

df1 = pd.read_csv(first, header=None)
df2 = pd.read_csv(second, sep='\s+', header=None)

merged = df1.merge(df2, left_on=[0], right_on=[1], suffixes=('first', 'second'))
a, b, c, d = merged['2second'], merged['1first'], merged['2first'], merged['3second']

cleaned = merged[(c>a)&(a>b)&(c>d)&(d>b)]

counted = cleaned.groupby(['0first', '1first', '2first', '3first', '4first', 5, 6, '0second'])['4second'].count().reset_index()

counted.to_csv('result.csv', index=False, header=False)

这将产生具有以下内容的result.csv

chr1,105000000,105310000,2,1,3,2,AKT3,1
chr19,107180000,107680000,1,1,5,4,AKAP8L,1

答案 1 :(得分:0)

在相同的设置下,如果您执行以下操作,则它将起作用。

posts: Post[];

这将产生与您想要的结果相同的结果。

first = open('first.csv', 'r')
second = open('second.txt', 'r')
first_file = []
for line in first:
    first_file.append(line.strip())
second_file = []
for line2 in second:
    second_file.append(line2)
count=0
final = []
for i in range(len(first_file)):
    for j in range(len(second_file)):
        first_row = first_file[i]
        second_row = second_file[j]
        first_col = first_row.split(',')
        second_col = second_row.split()
        if (first_col[0] == second_col[1]) and (first_col[1] < second_col[2] < first_col[2]) and (first_col[1] < second_col[3] < first_col[2]):
            count = count + 1
            final.append(first_row +','+second_col[0]+',' + str(count))
print(final)