使用python从dataframe / matrix计算每个分割序列中的值有哪些方法? 或者如何根据给定的数据框/矩阵(得分)计算字符串的分数? Dataframe(df)示例:
df
#output
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15
A 3 3 1 1 3 2 1 3 3 2 3 3 3 1 2
T 3 3 0 3 0 1 0 0 3 0 1 3 2 3 0
G 1 1 3 1 1 2 1 3 1 0 1 3 2 1 2
C 3 1 1 1 3 2 2 1 0 0 0 0 2 1 3
#using jupyter notebook python 3
seq = "ATGCGGCATTAT"
def split_n(text, n):
return [ text[i:i+n] for i in range(len(text)-(n-1)) ]
# split seq by 5
splited = split_n(seq,5)
splited
#output
['ATGCG', 'TGCGG', 'GCGGC', 'CGGCA', 'GGCAT', 'GCATT', 'CATTA', 'ATTAT']
df.iloc[0,1]
#output
0
#Something like this
#calculate values in each splited sequence
vls = []
vls = [col_val(splited,df, _ ) for _ in range(len(splited))]
vls
#output should give
[11, 7, 9, 9, 11, 10, 11, 12]
背景:
#ATGCG=(A,1)+(T,2)+(G,3)+(C,4)+(G,5)
#i.e=(3) + (3) + (3) + (1) + (1)
=11
#TGCGG=(T,1)+(G,2)+(C,3)+(G,4)+(G,5)
#i.e=(3) + (1) + (1) + (1) + (1)
=7
#GCGGC= (G,1)+(C,2)+(G,3)+(G,4)+(C,5)
#i.e =(1) + (1) + (3) + (1) + (3)
=9
#And so on
答案 0 :(得分:0)
您可以使用列表推导:
import pandas as pd
#loading translation dataframe
df = pd.read_csv("ATGCtranslate.csv", delim_whitespace=True)
#defining sequence and window for summation
seq = "ATGCGGCATTAT"
win_size = 5
#create frames of window size
seq_subl = [seq[i:i + win_size] for i in range(len(seq[:1 - win_size]))]
#sum for each frame
seq_sums = [sum(df.ix[k, i] for i, k in enumerate(sub_list)) for sub_list in seq_subl]
#output
#[11, 7, 9, 11, 4, 6, 12, 7]
这与您预期的输出不同,但是如果您查看最后一个序列'ATTAT',那么它应该根据您的样本数据帧7而不是12给出。
它将是12,如果它在位置v8开始数字转换,但这不是你在背景解释中计算事物的方式。但是,实现此版本的脚本将是:
import pandas as pd
#loading translation dataframe
df = pd.read_csv("ATGCtranslate.csv", delim_whitespace=True)
#defining sequence and window for summation
seq = "ATGCGGCATTAT"
win_size = 5
#translate seq according to data frame
seq_transl = [df.ix[k, i] for i, k in enumerate(seq)]
#rolling sum with window size
seq_sums = [sum(seq_transl[i:i + win_size]) for i in range(len(seq_transl[:1 - win_size]))]
#output
#[11, 10, 9, 9, 11, 10, 11, 12]
有可能直接在这里使用熊猫。我会将关键字pandas
标记为,以吸引合适的人。