根据两列获取频率

时间:2019-02-26 20:08:13

标签: r dplyr tibble

我的大型数据框的片段看起来像这样:

MARKERS.IN.HAPLOTYPES BASE           rs. alleles chrom       pos        GID marker   trial
                 1A.12    C S1A_494392059     C/G    1A 494392059 GID7173723      2 ES26-38
                 1A.13    C S1A_497201550     C/T    1A 497201550 GID7173723      0 ES26-38
                 1A.14    T S1A_499864157     C/T    1A 499864157 GID7173723      2 ES26-38
                 1B.10    A S1B_566171302     G/A    1B 566171302 GID7173723      0 ES26-38
                 1B.20    G S1B_642616640     A/G    1B 642616640 GID7173723      2 ES26-38
                 2B.10    A  S2B_24883552     A/G    2B  24883552 GID7173723      2 ES26-38

这是其中的dput

structure(list(MARKERS.IN.HAPLOTYPES = c("1A.12", "1A.13", "1A.14", 
"1B.10", "1B.20", "2B.10"), BASE = c("C", "C", "T", "A", "G", 
"A"), rs. = c("S1A_494392059", "S1A_497201550", "S1A_499864157", 
"S1B_566171302", "S1B_642616640", "S2B_24883552"), alleles = c("C/G", 
"C/T", "C/T", "G/A", "A/G", "A/G"), chrom = c("1A", "1A", "1A", 
"1B", "1B", "2B"), pos = c(494392059L, 497201550L, 499864157L, 
566171302L, 642616640L, 24883552L), GID = c("GID7173723", "GID7173723", 
"GID7173723", "GID7173723", "GID7173723", "GID7173723"), marker = c("2", 
 "0", "2", "0", "2", "2"), trial = c("ES26-38", "ES26-38", "ES26-38", 
 "ES26-38", "ES26-38", "ES26-38")), row.names = c(NA, 6L), class = 
 "data.frame")

在原始数据帧中,列unique有22个rs.值,列unique有六个trial值。我想为每个唯一的marker和每个唯一的rs.计算列trial的不同值的相对频率。因此,例如,rs.S1A_494392059的第一项将具有marker列用于试验ES26-38的频率,依此类推。请注意,列marker是字符向量,而不是数字。

1 个答案:

答案 0 :(得分:1)

您可以尝试以下方法:

name

add_count中的dplyr 0.8列是n之后的一项新功能,可让您确定名称(以前是nnrs.默认情况下)。如果您没有最新的软件包,则上面的代码将无效。

您的示例中的相对频率到处都是1,尽管它并不特别复杂。

如果要获取汇总数据框(剩下的唯一列将对trialRelativeFreqdf %>% add_count(rs., trial, marker, name = "MarkerTotal") %>% group_by(rs., trial) %>% summarise(RelativeFreq = round(MarkerTotal / n(), 2)) 进行分组,这就是您可以采取的措施:

import mysql.connector
employee_id_list = [10,15]
sql = "select value, employee_id " \
"from employee_table " \
"where employee_id in %s "    

df = pd.read_sql(sql,conn,params=[tuple(employee_id_list)])