使用python从dataframe / matrix计算每个分割序列中的值的方法是什么?

时间:2018-03-29 19:06:02

标签: python pandas dataframe matrix

使用python从dataframe / matrix计算每个分割序列中的值有哪些方法? 或者如何根据给定的数据框/矩阵(得分)计算字符串的分数? Dataframe(df)示例:

df

#output

    v1  v2  v3  v4  v5  v6  v7  v8  v9  v10 v11 v12 v13 v14 v15
A   3   3   1   1   3   2   1   3   3   2   3   3   3   1   2
T   3   3   0   3   0   1   0   0   3   0   1   3   2   3   0
G   1   1   3   1   1   2   1   3   1   0   1   3   2   1   2
C   3   1   1   1   3   2   2   1   0   0   0   0   2   1   3


    #using jupyter notebook python 3
    seq = "ATGCGGCATTAT"
    def split_n(text, n):
        return [ text[i:i+n] for i in range(len(text)-(n-1)) ]

    # split seq by 5
    splited = split_n(seq,5)
    splited

    #output
    ['ATGCG', 'TGCGG', 'GCGGC', 'CGGCA', 'GGCAT', 'GCATT', 'CATTA', 'ATTAT']

    df.iloc[0,1]
    #output
    0

   #Something like this 
   #calculate values in each splited sequence
    vls = []
    vls = [col_val(splited,df, _ ) for _ in range(len(splited))]
    vls

    #output should give
    [11, 7, 9, 9, 11, 10, 11, 12]

背景:

#ATGCG=(A,1)+(T,2)+(G,3)+(C,4)+(G,5)
        #i.e=(3) + (3) + (3) + (1) + (1)
           =11

    #TGCGG=(T,1)+(G,2)+(C,3)+(G,4)+(G,5)
        #i.e=(3) + (1) + (1) + (1) + (1)
           =7
    #GCGGC= (G,1)+(C,2)+(G,3)+(G,4)+(C,5)
         #i.e =(1) + (1) + (3) + (1) + (3)
             =9 
 #And so on

1 个答案:

答案 0 :(得分:0)

您可以使用列表推导:

import pandas as pd
#loading translation dataframe
df = pd.read_csv("ATGCtranslate.csv", delim_whitespace=True)
#defining sequence and window for summation
seq = "ATGCGGCATTAT"
win_size = 5
#create frames of window size
seq_subl = [seq[i:i + win_size] for i in range(len(seq[:1 - win_size]))]
#sum for each frame
seq_sums = [sum(df.ix[k, i] for i, k in enumerate(sub_list)) for sub_list in seq_subl]
#output
#[11, 7, 9, 11, 4, 6, 12, 7]

这与您预期的输出不同,但是如果您查看最后一个序列'ATTAT',那么它应该根据您的样本数据帧7而不是12给出。
它将是12,如果它在位置v8开始数字转换,但这不是你在背景解释中计算事物的方式。但是,实现此版本的脚本将是:

import pandas as pd
#loading translation dataframe
df = pd.read_csv("ATGCtranslate.csv", delim_whitespace=True)
#defining sequence and window for summation
seq = "ATGCGGCATTAT"
win_size = 5
#translate seq according to data frame
seq_transl = [df.ix[k, i] for i, k in enumerate(seq)]
#rolling sum with window size
seq_sums = [sum(seq_transl[i:i + win_size]) for i in range(len(seq_transl[:1 - win_size]))]
#output
#[11, 10, 9, 9, 11, 10, 11, 12]

有可能直接在这里使用熊猫。我会将关键字pandas标记为,以吸引合适的人。