根据python中的一列对数据进行分类

时间:2017-01-11 23:01:24

标签: python bioinformatics

您好我有一个数据集如下:

sample    pos    mutation
2fec2     40     TC
1f3c      40     TC
19b0      40     TC
tld3      60     CG

我希望能够找到一种python方式,例如找到2fec2和1f3c具有相同变异的每个实例并打印代码。到目前为止,我已经尝试了以下但它只是返回一切。我是python的新手,并试图让自己脱离awk - 请帮忙!

from sys import argv
script, vcf_file = argv
import vcf
vcf_reader = vcf.Reader(open(vcf_file, 'r'))
for record.affected_start in vcf_reader: #.affect_start is this modules way of calling data from the parsed pos column from a particular type of bioinformatics file
    if record.sample == 2fec2 and 1f3c != 19b0 !=t1d3: #ditto .sample
        print record.affected_start

2 个答案:

答案 0 :(得分:1)

我假设您的数据采用您描述的格式,而不是VCF。

您可以尝试使用标准python技术简单地解析文件,并为每个(pos,mutation)对解析,使用它构建一组样本:

from sys import argv
from collections import defaultdict
# More convenient than a normal dict: an empty set will be
# automatically created whenever a new key is accessed
# keys will be (pos, mutation) pairs
# values will be sets of sample names
mutation_dict = defaultdict(set)
# This "with" syntax ("context manager") is recommended
# because file closing will be handled automatically
with open(argv[1], "r") as data_file:
    # Read first line and check headers
    # (assert <something False>, "message"
    # will make the program exit and display "message")
    assert data_file.readline().strip().split() == ["sample", "pos", "mutation"], "Unexpected column names"
    # .strip() removes end-of-line character
    # .split() splits into a list of words
    # (by default using "blanks" as separator)
    # .readline() has "consumed" a first line.
    # Now we can loop over the rest of the lines
    # that should contain the data
    for line in data_file:
        # Extract the fields
        [sample, pos, mutation] = line.strip().split()
        # add the sample to the set of samples having
        # this (pos, mutation) combination
        mutation_dict[(pos, mutation)].add(sample)
    # Now loop over the key, value pairs in our dict:
    for (pos, mutation), samples in mutation_dict.items():
        # True if set intersection (&) is not empty
        if samples & {"2fec2", "1f3c"}:
            print("2fec2 and 1f3c share mutation %s at position %s" % (mutation, pos))

以示例数据作为脚本的第一个参数,输出:

2fec2 and 1f3c share mutation TC at position 40

答案 1 :(得分:0)

这个怎么样

from sys import argv
script, vcf_file = argv
import vcf
vcf_reader = vcf.Reader(open(vcf_file, 'r'))

# Store our results outside of the loop
fecResult = ""
f3cResult = ""

# For each record
for record.affected_start in vcf_reader: 
    if record.sample == "2fec2":
        fecResult = record.mutation
    if record.sample == "1f3c":
        f3cResult = record.mutation

# Outside of the loop compare the results and if they match print the record.
if fecResult == f3cResult:
    print record.affected_start