将2个文本文件汇总为python中的一个文件

时间:2018-11-13 21:42:32

标签: python

我有两个文件,分别是bigsmall,例如以下示例:

big

chr1    transcript      2481359 2483515 -       RP3-395M20.8
chr1    transcript      2487078 2492123 +       TNFRSF14
chr1    transcript      2497849 2501297 +       RP3-395M20.7
chr1    transcript      2512999 2515942 +       RP3-395M20.9
chr1    transcript      2517930 2521041 +       FAM213B
chr1    transcript      2522078 2524087 -       MMEL1

small

chr1    2487088 2492113 17
chr1    100757323       100757324       19
chr1    2487099 2492023 21
chr1    100758316       100758317       41
chr1    2514000 2515742 14

我尝试使用5 columns从大文件中创建一个新文件, 满足以下条件:

conditions

1- if: the 1st column of small file == 1st column of big file
2- if: the 4th column of big file >= the 2nd column of small file >= the 3rd column of big file
3- if: the 4th column of big file >= the 3rd column of small file >= the 3rd column of big file

columns in output file

1) 1st column of big file
2) 2nd column of big file
3) 3rd column of big file
4) the number of lines in small files that have the mentioned conditions (we should count)
5) 6th column of big file

以下是上述示例的预期输出:

chr1    2487078 2492123 2       TNFRSF14
chr1    2512999 2515942 1       RP3-395M20.9

我在python中编写了以下代码。它不返回该文件 我想要。我代码中的每一行似乎都是合乎逻辑的。你能帮我吗 修复它?

def correspond(big, small, outfile):
    count = 0
    big = open(big, "r")
    small = open(small, "r")
    big_list = []
    small_list = []
    for m in big:
        big_list.append(m)
    for n in small:
        small_list.append(n)
    final = []
    for i in range(0, len(small_list)):
        for j in range(0, len(big_list)):
            small_row = small_list[i]
            big_row = big_list[j]
            small_columns = small_row.split()
            big_columns = big_row.split()
            small_symbol = small_columns[0]
            big_symbol = big_columns[0]
            name = big_columns[5]
            if small_symbol == big_symbol:
                small_second_col = small_columns[1]
                small_third_col = small_columns[2]
                min_range = big_columns[2]
                max_range = big_columns[3]
                if (small_second_col <= max_range and small_second_col >= min_range and small_third_col <= max_range and small_third_col >= min_range):
                        count+=1
                        new_line = small_row.rstrip("\n") + " " + big_symbol + " " + min_range + " " + max_range + str(count) + name
                        final.append(new_line)
    with open(outfile, "w") as f:
        for item in final:
            f.write("%s\n" % item)

2 个答案:

答案 0 :(得分:1)

完整的解决方案,没有熊猫:

from itertools import product


def str_or_int(item):
    try:
        return int(item)
    except ValueError:
        return item

def correspond(big, small, output):
    with open(big, 'r') as bigf, open(small, 'r') as smallf, open(output, 'w') as outputf:
        current = None
        count = 0
        for b_line, s_line in product(filter(lambda x: x != '\n', bigf), filter(lambda x: x != '\n', smallf)):
            if b_line != current:
                if count > 0:
                    out_line = current.split()
                    outputf.write('\t'.join((out_line[0], out_line[1], out_line[2], str(count), out_line[5])) + '\n')
                current = b_line
                count = 0
            b_line = [str_or_int(s) for s in b_line.split()]
            s_line = [str_or_int(s) for s in s_line.split()]
            try:
                if b_line[0] == s_line[0] and b_line[3] >= s_line[1] >= b_line[2] and b_line[3] >= s_line[2] >= b_line[2]:
                    count += 1
            except IndexError:
                continue

如有疑问,请发表评论

答案 1 :(得分:0)

给出如下示例输入:

big = '''chr1    transcript      2481359 2483515 -       RP3-395M20.8
chr1    transcript      2487078 2492123 +       TNFRSF14
chr1    transcript      2497849 2501297 +       RP3-395M20.7
chr1    transcript      2512999 2515942 +       RP3-395M20.9
chr1    transcript      2517930 2521041 +       FAM213B
chr1    transcript      2522078 2524087 -       MMEL1'''

small = '''chr1    2487088 2492113 17
chr1    100757323       100757324       19
chr1    2487099 2492023 21
chr1    100758316       100758317       41
chr1    2514000 2515742 14'''

big, small = ([l.split() for l in d.splitlines()] for d in (big, small))

您可以将sum与生成器表达式一起使用,以计算small中符合条件的行数,然后使用str.join产生所需的输出:

for name_big, _, low, high, _, note in big:
    count = sum(1 for name_small, n1, n2, _ in small if name_big == name_small and all(int(low) <= int(n) <= int(high) for n in (n1, n2)))
    if count:
        print('\t'.join((name_big, low, high, str(count), note)))

这将输出:

chr1    2487078 2492123 2   TNFRSF14
chr1    2512999 2515942 1   RP3-395M20.9