SQLalchemy:迭代+计数和func.count()结果之间的差异

时间:2017-06-30 10:17:36

标签: python sqlalchemy

我有一系列像这样定义的类(注意:它们通过automap扩展已存在的数据库,因此对下面列的引用可能不会反映这里的类):

class VariantAssociation(Base):

    __tablename__ = "sample_variant_association"

    vid = Column(Integer, ForeignKey("variants.variant_id"),
                primary_key=True, index=True)
    sid = Column(Integer, ForeignKey("samples.sample_id"),
                primary_key=True, index=True)

    vdepth = Column(Integer, index=True)
    valt_depth = Column(Integer, index=True)
    gt = Column(Text)
    gt_type = Column(Integer)
    fraction = Column(Float, index=True)

    variant = relationship("Variant", back_populates="samples")
    sample = relationship("Samples", back_populates="variants")


class Variant(Base):

    __tablename__ = "variants"

    variant_id = Column(Integer, primary_key=True)
    info = deferred(Column(LargeBinary))

    samples = relationship("VariantAssociation",
                        back_populates="variant")

    def __repr__(self):

        data = "<Variant {chrom}:{start}-{end} {gene} {ref}/{alt} {type}>"

        return data.format(chrom=self.chrom,
                        start=self.start,
                        end=self.end,
                        gene=self.gene,
                        ref=self.ref,
                        alt=self.alt,
                        type=self.type)


class Samples(Base):

    __tablename__ = "samples"

    sample_id = Column(Integer, primary_key=True, index=True)
    name = Column(Text, index=True)
    variants = relationship("VariantAssociation",
                            back_populates="sample")

它们是在一个相当复杂的查询中组装的,但在这里很简单:

query = session.query(Variant).join(VariantAssociation.variant_id).join(Samples)
query = query.filter(VariantAssociation.vdepth >= 60)

现在,我想计算两列的组合:refalt

我认为这很简单:

query = query.with_entities(Variant.ref, Variant.alt, 
    func.count()).distinct().group_by(gemini.Variant.ref, gemini.Variant.alt)

哪个收益率(一行示例):

('A', 'C', 308)

但是,如果我只是迭代查询并计算:

from collections import defaultdict, Counter
counts  = defaultdict(Counter)
for row in query.with_entities(Variant.ref, Variant.alt):
    counts[f"{row.ref}>{row.alt}"].update(["present"])

给了我

'A>C': Counter({'present': 155})

几乎是我通过count找到的一半。我知道后者是正确的,而不是前者。但我想使用前者,后者可能会非常慢(大型SQLite数据库)。

我是否设置错误计数?

编辑:根据要求,count的完整查询(包括来自数据库本身的更多过滤器)

SELECT DISTINCT variants.ref AS variants_ref, variants.alt AS variants_alt, count(*) AS count_1 
FROM variants JOIN sample_variant_association ON variants.variant_id = sample_variant_association.vid JOIN
samples ON samples.sample_id = sample_variant_association.sid 
WHERE sample_variant_association.gt_type != ? AND variants.impact NOT IN (?, ?, ?, ?) AND
sample_variant_association.vdepth >= ? AND sample_variant_association.fraction >= ? AND variants.chrom NOT IN (?,
?) AND variants.aaf_1kg_eur < ? AND variants.type = ? AND sample_variant_association.fraction >= ? AND
sample_variant_association.vdepth >= ? GROUP BY variants.ref, variants.alt

曾经迭代的那个:

    SELECT DISTINCT variants.ref AS variants_ref, variants.alt AS variants_alt 
FROM variants JOIN sample_variant_association ON variants.variant_id = sample_variant_association.vid JOIN
samples ON samples.sample_id = sample_variant_association.sid 
WHERE sample_variant_association.gt_type != ? AND variants.impact NOT IN (?, ?, ?, ?) AND
sample_variant_association.vdepth >= ? AND sample_variant_association.fraction >= ? AND variants.chrom NOT IN (?,
?) AND variants.aaf_1kg_eur < ? AND variants.type = ? AND sample_variant_association.fraction >= ? AND
sample_variant_association.vdepth >= ?

编辑2:我追溯到基本查询中是否存在重复的variant_ids:

query.with_entities(gemini.Variant.variant_id).count()
18288
query.with_entities(gemini.Variant.variant_id).distinct().count()
14437

所以问题与我原先的想法不同。不知何故,重复记录在循环中被考虑,但不在func.count()中。

1 个答案:

答案 0 :(得分:0)

使用子查询工作,首先删除重复项:

id_subquery = query.with_entities(Variant.variant_id).distinct().subquery()

然后获取实际数据:

c_query = session.query(Variant.ref, Variant.alt, func.count(1))
c_query = c_query.filter(Variant.variant_id.in_(id_subquery))
c_query = c_query.group_by(Variant.ref, Variant.alt)

c_query.first()
('A', 'C', 155)