我的大型数据框的片段看起来像这样:
MARKERS.IN.HAPLOTYPES BASE rs. alleles chrom pos GID marker trial
1A.12 C S1A_494392059 C/G 1A 494392059 GID7173723 2 ES26-38
1A.13 C S1A_497201550 C/T 1A 497201550 GID7173723 0 ES26-38
1A.14 T S1A_499864157 C/T 1A 499864157 GID7173723 2 ES26-38
1B.10 A S1B_566171302 G/A 1B 566171302 GID7173723 0 ES26-38
1B.20 G S1B_642616640 A/G 1B 642616640 GID7173723 2 ES26-38
2B.10 A S2B_24883552 A/G 2B 24883552 GID7173723 2 ES26-38
这是其中的dput
:
structure(list(MARKERS.IN.HAPLOTYPES = c("1A.12", "1A.13", "1A.14",
"1B.10", "1B.20", "2B.10"), BASE = c("C", "C", "T", "A", "G",
"A"), rs. = c("S1A_494392059", "S1A_497201550", "S1A_499864157",
"S1B_566171302", "S1B_642616640", "S2B_24883552"), alleles = c("C/G",
"C/T", "C/T", "G/A", "A/G", "A/G"), chrom = c("1A", "1A", "1A",
"1B", "1B", "2B"), pos = c(494392059L, 497201550L, 499864157L,
566171302L, 642616640L, 24883552L), GID = c("GID7173723", "GID7173723",
"GID7173723", "GID7173723", "GID7173723", "GID7173723"), marker = c("2",
"0", "2", "0", "2", "2"), trial = c("ES26-38", "ES26-38", "ES26-38",
"ES26-38", "ES26-38", "ES26-38")), row.names = c(NA, 6L), class =
"data.frame")
在原始数据帧中,列unique
有22个rs.
值,列unique
有六个trial
值。我想为每个唯一的marker
和每个唯一的rs.
计算列trial
的不同值的相对频率。因此,例如,rs.
列S1A_494392059
的第一项将具有marker
列用于试验ES26-38
的频率,依此类推。请注意,列marker
是字符向量,而不是数字。
答案 0 :(得分:1)
您可以尝试以下方法:
name
add_count
中的dplyr 0.8
列是n
之后的一项新功能,可让您确定名称(以前是nn
或rs.
默认情况下)。如果您没有最新的软件包,则上面的代码将无效。
您的示例中的相对频率到处都是1,尽管它并不特别复杂。
如果要获取汇总数据框(剩下的唯一列将对trial
,RelativeFreq
和df %>%
add_count(rs., trial, marker, name = "MarkerTotal") %>%
group_by(rs., trial) %>%
summarise(RelativeFreq = round(MarkerTotal / n(), 2))
进行分组,这就是您可以采取的措施:
import mysql.connector
employee_id_list = [10,15]
sql = "select value, employee_id " \
"from employee_table " \
"where employee_id in %s "
df = pd.read_sql(sql,conn,params=[tuple(employee_id_list)])