您好我有一个数据集如下:
sample pos mutation
2fec2 40 TC
1f3c 40 TC
19b0 40 TC
tld3 60 CG
我希望能够找到一种python方式,例如找到2fec2和1f3c具有相同变异的每个实例并打印代码。到目前为止,我已经尝试了以下但它只是返回一切。我是python的新手,并试图让自己脱离awk - 请帮忙!
from sys import argv
script, vcf_file = argv
import vcf
vcf_reader = vcf.Reader(open(vcf_file, 'r'))
for record.affected_start in vcf_reader: #.affect_start is this modules way of calling data from the parsed pos column from a particular type of bioinformatics file
if record.sample == 2fec2 and 1f3c != 19b0 !=t1d3: #ditto .sample
print record.affected_start
答案 0 :(得分:1)
我假设您的数据采用您描述的格式,而不是VCF。
您可以尝试使用标准python技术简单地解析文件,并为每个(pos,mutation)对解析,使用它构建一组样本:
from sys import argv
from collections import defaultdict
# More convenient than a normal dict: an empty set will be
# automatically created whenever a new key is accessed
# keys will be (pos, mutation) pairs
# values will be sets of sample names
mutation_dict = defaultdict(set)
# This "with" syntax ("context manager") is recommended
# because file closing will be handled automatically
with open(argv[1], "r") as data_file:
# Read first line and check headers
# (assert <something False>, "message"
# will make the program exit and display "message")
assert data_file.readline().strip().split() == ["sample", "pos", "mutation"], "Unexpected column names"
# .strip() removes end-of-line character
# .split() splits into a list of words
# (by default using "blanks" as separator)
# .readline() has "consumed" a first line.
# Now we can loop over the rest of the lines
# that should contain the data
for line in data_file:
# Extract the fields
[sample, pos, mutation] = line.strip().split()
# add the sample to the set of samples having
# this (pos, mutation) combination
mutation_dict[(pos, mutation)].add(sample)
# Now loop over the key, value pairs in our dict:
for (pos, mutation), samples in mutation_dict.items():
# True if set intersection (&) is not empty
if samples & {"2fec2", "1f3c"}:
print("2fec2 and 1f3c share mutation %s at position %s" % (mutation, pos))
以示例数据作为脚本的第一个参数,输出:
2fec2 and 1f3c share mutation TC at position 40
答案 1 :(得分:0)
这个怎么样
from sys import argv
script, vcf_file = argv
import vcf
vcf_reader = vcf.Reader(open(vcf_file, 'r'))
# Store our results outside of the loop
fecResult = ""
f3cResult = ""
# For each record
for record.affected_start in vcf_reader:
if record.sample == "2fec2":
fecResult = record.mutation
if record.sample == "1f3c":
f3cResult = record.mutation
# Outside of the loop compare the results and if they match print the record.
if fecResult == f3cResult:
print record.affected_start