复杂的熊猫子设置;选择与许多列中的条件匹配的行

时间:2018-11-15 12:21:55

标签: python pandas

我正在从pandas数据帧中选择数据,该数据帧约为150万行乘22列。每列都是样本,每行都是关于突变的观察结果。 1.0表示样本具有突变,0.0表示样本没有特定的突变,0.5表示样本中没有该突变的数据。

样本来自3种组织类型之一,称为组织AE,BE和HE。样本符合以下类别:

AE=["X14AE","X15AE","X22AE","X23AE","AE21.35","AE36.45","AE46.55","AE61.80",]
BE=["X161724BE","BE1.2","BE1.8","BE2","BE9.13"]
HE=["X11HE","X18HE","HE17.24","HE2.4.5.6","HE8.15","HE8.9"]

我已经对熊猫进行了以下查询,它们都可以工作,但是显得笨拙:

-Get variants in AE and in HE but not in BE
-Get variants in AE and in BE but not in HE
-Get variants in BE and in HE but not in AE

代码如下:

    """Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) | 
        (df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) & ((df["X161724BE"] != 1.0) & (df["BE1.2"] != 1.0) &
(df["BE1.8"] != 1.0) & (df["BE2"] != 1.0) & (df["BE9.13"] != 1.0)) & ((df["X161724BE"] != 0.5) | (df["BE1.2"] != 0.5) |
(df["BE1.8"] != 0.5) | (df["BE2"] != 0.5) | (df["BE9.13"] != 0.5))]


"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) | 
        (df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] != 1.0) & (df["X18HE"] != 1.0) &(df["HE17.24"] != 1.0) & (df["HE2.4.5.6"] != 1.0) &
(df["HE8.15"] != 1.0) & (df["HE8.9"] != 1.0)) & 
 ((df["X11HE"] != 0.5) | (df["X18HE"] != 0.5) |(df["HE17.24"] != 0.5) |(df["HE2.4.5.6"] != 0.5) |
(df["HE8.15"] != 0.5) | (df["HE8.9"] != 0.5)) & 
 ((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
        (df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) &
 ((df["X14AE"] != 1.0) & (df["X15AE"] != 1.0) & (df["X22AE"] != 1.0) &
(df["X23AE"] != 1.0) & (df["AE21.35"] != 1.0) & (df["AE36.45"] != 1.0) & (df["AE61.80"] != 1.0)) &
 ((df["X14AE"] != 0.5) | (df["X15AE"] != 0.5) | (df["X22AE"] != 0.5) | 
(df["X23AE"] != 0.5) | (df["AE21.35"] != 0.5) | (df["AE36.45"] != 0.5) | (df["AE61.80"] != 0.5))]

这可以很好地工作,但是看起来笨拙,而且不是很优雅,如果我需要更改一些内容(例如样本名称),则需要花费很多时间来重写,有人可以通过简单的方法来帮助我吗?写这个查询?我想知道是否有一种方法可以仅通过标准传递每个列表?像这样:

AE_HE_notBE = df.loc[((df.[at least 1 sample from AE_list] == 1.0) & (df.[at least 1 sample from HE_list] == 1.0) & (df.[no sample from BE_list] == 1.0) & (df.[at least 1 sample from BE_list] == 0.0))]

我发现我需要基于多个列对行进行子集化,在这些列中可以非常定期地对这些列进行分组,因此,如果有人可以使这种查询更加简洁,我将不胜感激。非常感谢

编辑:根据要求提供的最小示例:

mutations=[[1,1,0,0,0.5,0],
[1,0,0,0,1,0],
[1,1,0,0.5,0,0],
[0,0.5,0,1,0,1],
[0,1,0,0,0,0],
[1,0,0,0,0,0],
[1,0,1,0,1,0],
[0,0,0,1,0.5,1],
[0,1,1,1,0,0],
[1,0.5,0,1,0,0]]

import string
import pandas as pd
m_list=[x for x in string.ascii_lowercase[:10]]

df=pd.DataFrame(columns=['AE1','AE2','BE1','BE2','HE1','HE2']) 
for m,n in zip(m_list, mutations):
    df.loc[m]=n

AE=['AE1','AE2']
BE=['BE1','BE2']
HE=['HE1','HE2']

"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["BE1"] != 1.0) & (df["BE2"] != 1.0)) & ((df["BE2"] != 0.5) | (df["BE2"] != 0.5))]

"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] != 1.0) & (df["HE2"] != 1.0)) & ((df["HE2"] != 0.5) | (df["HE2"] != 0.5))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["AE1"] != 1.0) & (df["AE2"] != 1.0)) & ((df["AE2"] != 0.5) | (df["AE2"] != 0.5))]

这显示了问题的极简化示例。使用多个条件来选择df的子集,我想在整个列组中执行一种子集类型,而在另一组列上执行另一种类型的子集,但是当您有很多话要说时,这将变得非常混乱10列。在第一个示例中显示了一个更实际的示例,正​​如已经指出的那样,这几乎是不可读的-正是我的观点-有一种更整洁的方法来编写这种复杂的查询/子集,其中多个列需要相同的选择操作对他们表演?我将不胜感激。

1 个答案:

答案 0 :(得分:1)

eq + any / all + loc

向量化后,您可以子集数据框并使用相等性和any / all操作:

# Get variants in AE and in HE but not in BE

m1 = df[AE].eq(1.0).any(1)
m2 = df[HE].eq(1.0).any(1)
m3 = df[BE].eq(0).all(1)

df_filtered = df.loc[m1 & m2 & m3]

如果按照您的描述,所有值均为00.51.0,则表示所选值不能为1.00.5是相同的要求他们是0