我是编程新手,但要解决以下问题。我希望找到两个文件的交集。
例如,文件1:
> scaffold1 0 206 transcript_loc.00001 exon
> scaffold1 262 749 transcript_loc.00001 exon
> scaffold1 1391 1549 transcript_loc.00001 exon
FILE2:
scaffold1 517 540 Simple_repeat
scaffold1 1063 1162 LTR/Gypsy
scaffold1 1400 1498 LTR
我想调用两个文件坐标相交的位置,然后将相交附加到File1。
结果文件:
scaffold1 0 206 transcript_loc.00001 exon
scaffold1 262 749 transcript_loc.00001 exon 517 540 Simple_repeat
scaffold1 1391 1549 transcript_loc.00001 exon 1400 1498 LTR
任何关于从哪里开始的建议都是最受欢迎的。我可以使用BedTools Intersect找到交叉点,但我真的想知道如何使用python执行此操作。
谢谢。
答案 0 :(得分:1)
如果两个文件都按照示例中的顺序进行排序,那么您可以在从一个或两个中智能地逐行读取时找到交叉点。这对于进程的内存消耗更友好,并且不需要为file1中的每一行循环file2。这对于像你的例子这样的小文件无关紧要,但是当这些文件很大时会帮助你。
请注意,我只考虑了启动和停止列以匹配DNA链。如果还需要匹配其他字段,则此代码将无效(但您可以对其进行调整)。
在现实生活中使用它之前,需要进行更多检查(例如,源文件中的行)。
代码:
def read_and_parse_line(fh):
# It gets the next line,
# strip any whitespace from the end,
# split it on whitespace into fields.
return fh.next().rstrip().split()
def append_to_result(result, line):
(scaffold1, start1, end1, transcript1, type1, start2, end2, comment2) = line
# Making start and end real integers so you can work with them as such.
result.append((scaffold1, int(start1), int(end1), transcript1, type1, int(start2), int(end2), comment2))
# These will be our result lines.
result = []
# With makes sure the files are properly closed afterwards.
with open('file1') as file1, open('file2') as file2:
# Getting the iterator on the files.
file1 = iter(file1)
file2 = iter(file2)
try:
(scaffold1, start1, end1, transcript1, type1) = read_and_parse_line(file1)
read_second = True
while True:
if read_second:
try:
(scaffold2, start2, end2, comment2) = read_and_parse_line(file2)
except StopIteration:
print("File 2 ended.")
# Nothing more in the second file.
# Outputting the original line
result.append((scaffold1, start1, end1, transcript1, type1, 0, 0, None))
# Reading in a new line from the first file.
(scaffold1, start1, end1, transcript1, type1) = read_and_parse_line(file1)
if int(start1) <= int(start2):
# Second file record starts after the first file record.
if int(end1) >= int(end2):
# And the second file record ends before the first file record.
# = MATCH!!!
print("We've found a match: [%s-[%s-%s]-%s]" % (start1, start2, end2, end1))
# Write the combined line.
result.append((scaffold1, start1, end1, transcript1, type1, start2, end2, comment2))
## READ BOTH
(scaffold1, start1, end1, transcript1, type1) = read_and_parse_line(file1)
read_second = True
else:
# The second file record ends AFTER the first file record.
print("No match in second file: [%s-%s]" % (start1, end1))
# So writing out the unmodified line.
result.append((scaffold1, start1, end1, transcript1, type1, 0, 0, None))
## READ 1
(scaffold1, start1, end1, transcript1, type1) = read_and_parse_line(file1)
read_second = False
else:
# This is thrown away in your example.
print("No match in the first file for: [%s-%s]" % (start2, end2))
# Trying the next line in the second file.
read_second = True
except StopIteration:
print("File 1 ended.")
# Printing a clean line after all the logging above.
print("")
# Printing the result to screen.
# I guess you want to do more stuff with the data and not just write it to file.
for line in result:
# The 8th element being equal to None indicates there was no match.
if line[7] == None:
print("%s\t%s\t%s\t%s\t%s" % line[:5])
else:
print("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" % line)
答案 1 :(得分:0)
因为您没有指定任何其他约束(例如,需要就地进行交叉,在效率或最佳速度方面的任何要求,存在一个漂亮的算法&#39;等等。),你可以(而且应该!)使用最愚蠢的策略:
file1
中的第一行,这将是(0,206)。在转到下一行之前,请检查file2
的所有内容,查看在您想要的时间间隔内具有坐标的第一行。
例如,如果你想要的间隔是(0,206),那么什么都没有
file2
之间的坐标是 - 如果你想要的间隔是
(262,749),然后第二行坐标(517,540)
匹配。
如果file2
中有相应的行,将添加到file1
中的现有行。
返回第2步,直至完成。
path_to_file1 = '<insert path of file1 here>'
path_to_file2 = '<insert path of file2 here>'
path_to_final_output_file = '<insert desired path here>'
# To make the code simpler, I first create a single string that I
# will append each successive line to. At the end, I will just dump
# the whole thing into a new file and call it a day.
final_output_contents = ''
with open(path_to_file1, 'r') as file1:
for line in file1:
# split '> scaffold1 0 206 transcript_ loc.00001 exon '
# into ['>', 'scaffold1', '0', '206', 'transcript_loc.00001', 'exon']
# then take the third and fourth elements respectively to get
# your interval
# `.split()` will break the string up by whitespace, [2:4]
# is Python's slice notation, basically a clever way to select
# elements in an array. This will return 0, 206 for the first
# line in the example.
interval_start, interval_end = line.split()[2:4]
# now read through the entirety of `file2`. Do the same
# as before: split each line, get the coordinates, and this time
# check if they meet the condition that they are in the interval
with open(path_to_file2, 'r') as file2:
for other_line in file2:
first_coordinate, second_coordinate = other_line.split()[1:3]
# `int()` converts its arguments into integers - we need it
# because
if (int(interval_start) < int(first_coordinate) and
int(interval_end) > int(second_coordinate)):
# this is just a fancy way of adding `line` and
# the last three elements of `other_line` together
line += '\t'.join(other_line.split()[-3:])
# break the `for other_line in file2` loop!
break
# add this line to your final file, making sure to add
# a newline character '\n' so your text editor will insert
# an appropriate line break.
final_output_contents += line + '\n'
# now dump the contents to your file
with open(path_to_final_output_file, 'w') as final_output_file:
final_output_file.write(final_output_contents)