Question

我有2个tab separated text file。其中一个称为major，另一个称为minor。这些是文件的2个小示例：

major：

chr1    +   1071396 1271396 LOC
chr12   +   1101483 1121483 MIR200B

minor：

chr1    1071496 1071536 1
chr1    1071536 1071566 0
chr1    1073566 1073366 1
chr12   1101487 1101516 0
chr12   1101625 1101671 1

我想从这2个文件中创建一个新文件。实际上，我必须按照以下步骤来获取最终文件：

step1：将columns 3 and 4文件中major之间的差异除以100。在这一步中，我从主文件中创建一个新文件，其中行数将是主文件中行数的100倍。在这个新文件中，将进行2次更改。

1st: columns 3 and 4 will be changed
2nd: I will add a new column called part (in this file that would be part 1 to part 100 per row in major file)



(1071396−1271396)÷100 = 2000 ----> this would be the new difference between columns 3 and 4 

chr1    +   1071396 1073396 LOC LOC_part1
chr1    +   1073396 1075396 LOC LOC_part2
.
.
.
chr1    +   1269396 1271396 LOC LOC_part100
chr12   +   1101483 1101683 MIR200B MIR200B_part1
chr12   +   1101683 1101883 MIR200B MIR200B_part2
.
.
.
chr12   +   1121283 1121483 MIR200B MIR200B_part100

从现在开始，此新文件将充当下一步的主要文件。我叫new_major。

step2：根据以下条件，计算次要文件中与new_major文件中每一行匹配的行数：

A) column 1 in minor file == column 1 in new_major
and
B) (column3 of new_major) <= (column2 of minor file) <= (column4 of new_major)
and
C)(column3 of new_major) <= (column3 of minor file) <= (column4 of new_major)

step3：制作具有7列的最终tab separated文件。前6列将类似于new_major文件，而第7列将是步骤2中的计数。

预期输出如下：

expected output：

chr1    +   1071396 1073396 LOC LOC_part1   2
chr1    +   1073396 1075396 LOC LOC_part2   1
.
.
.
chr1    +   1269396 1271396 LOC LOC_part100 0
chr12   +   1101483 1101683 MIR200B MIR200B_part1   2
chr12   +   1101683 1101883 MIR200B MIR200B_part2   0
.
.
.
chr12   +   1121283 1121483 MIR200B MIR200B_part100 0

我编写了以下代码来获得预期的输出，但它给出了错误。错误出现在代码后面。

major = open('major.txt', 'rb')
minor = open('minor.txt', 'rb')

minor = []
for line in minor:
    minor.append(line)

major = []
for line in major:
    major.append(line)


new_major = []
for i in major:
    percent = (i[3]-i[2])/100
    for j in percent:
        new_major.append(i[0], i[1], i[2], i[2]+percent, i[4], i[4]_'part'percent[j])


new_major, minor = ([l.split() for l in d.splitlines()] for d in (new_major, minor))

for name_major, sign, low, high, note in major:
    parts = list(range(int(low), int(high) + 1, (int(high) - int(low)) // 100))
    for part, (low, high) in enumerate(zip(parts, parts[1:]), 1):
        count = sum(1 for name_minor, n1, n2, _ in minor if name_major == name_minor and all(low <= int(n) <= high for n in (n1, n2)))
        print('\t'.join((name_major, sign, str(low), str(high), note, '%s_part%d' % (note, part), str(count))))

这是我遇到的错误：

>>> for name_major, sign, low, high, note in major:
...     parts = list(range(int(low), int(high) + 1, (int(high) -
int(low)) // 100))
...     for part, (low, high) in enumerate(zip(parts, parts[1:]), 1):
...         count = sum(1 for name_minor, n1, n2, _ in minor if
name_major == name_minor and all(low <= int(n) <= high for n in (n1,
n2)))
...         gg = ('\t'.join((name_major, sign, str(low), str(high),
note, '%s_part%d' % (note, part), str(count))))
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: too many values to unpack

您知道如何解决该问题吗？

Answer 1

我认为您想解压缩new_major而不是仅作为python文件开头的文件读取器的major解压缩。

for name_major, sign, low, high, note in new_major:

请确保还使用file_object.close()关闭文件以释放资源。

在python中汇总大文本文件时出现“无法解包的值太多”错误

1 个答案: