我有一个类似于以下内容的MultiIndex Pandas DataFrame:
import pandas as pd
import numpy as np
genotype_data = [
['0/1', '120,60', 180, 5, 0.5, '0/1', '200,2', 202, 99, 0.01],
['0/1', '200,20', 60, 99, 0.1, '0/1', '200,50', 250, 99, 0.4],
['0/1', '200,2', 202, 99, 0.01, '0/1', '200,2', 202, 99, 0.01]
]
genotype_columns = [['Sample1', 'Sample2'], ['GT', 'AD', 'DP', 'GQ', 'AB']]
cols = pd.MultiIndex.from_product(genotype_columns)
genotype = pd.DataFrame(data=genotype_data, columns=cols)
info_columns = [['INFO'], ['AC', 'DEPTH']]
cols = pd.MultiIndex.from_product(info_columns)
info = pd.DataFrame(data=[[12, 100], [23, 200], [40, 40]], columns=cols)
df = pd.concat([info, genotype], axis=1)
我想过滤任何行的df,其中至少有一个样本(在这种情况下为Sample1或Sample2)具有DP> = 50& GQ< 4.在这些条件下,除第一行外,所有行都应过滤掉。
我不知道从哪里开始,并希望得到一些帮助。
编辑:
在jezrael的帖子的帮助下,我找到了解决方案。代码如下:
genotype = df.ix[:,3:]
DP = genotype.xs('DP', axis=1, level=1)
GQ = genotype.xs('GQ', axis=1, level=1)
conditions = (DP.ge(50) & GQ.le(4)).T.any()
df = df[conditions]
return df
答案 0 :(得分:2)
我认为你可以使用:
#data in sample change for matching (first 99 in HQ in Sample1 was changed to 3)
genotype_data = [
['0/1', '120,60', 180, 5, 0.5, '0/1', '200,2', 202, 99, 0.01],
['0/1', '200,20', 60, 3, 0.1, '0/1', '200,50', 250, 99, 0.4],
['0/1', '200,2', 202, 99, 0.01, '0/1', '200,2', 202, 99, 0.01]
]
genotype_columns = [['Sample1', 'Sample2'], ['GT', 'AD', 'DP', 'GQ', 'AB']]
cols = pd.MultiIndex.from_product(genotype_columns)
genotype = pd.DataFrame(data=genotype_data, columns=cols)
info_columns = [['INFO'], ['AC', 'DEPTH']]
cols = pd.MultiIndex.from_product(info_columns)
info = pd.DataFrame(data=[[12, 100], [23, 200], [40, 40]], columns=cols)
df = pd.concat([info, genotype], axis=1)
print (df)
INFO Sample1 Sample2
AC DEPTH GT AD DP GQ AB GT AD DP GQ AB
0 12 100 0/1 120,60 180 5 0.50 0/1 200,2 202 99 0.01
1 23 200 0/1 200,20 60 3 0.10 0/1 200,50 250 99 0.40
2 40 40 0/1 200,2 202 99 0.01 0/1 200,2 202 99 0.01
df1 = df.filter(like='Sample')
df = df.loc[[(df1.xs('DP', axis=1, level=1).ge(50) &
df1.xs('GQ', axis=1, level=1).lt(4)).any(1).idxmax()]]
print (df)
INFO Sample1 Sample2
AC DEPTH GT AD DP GQ AB GT AD DP GQ AB
1 23 200 0/1 200,20 60 3 0.1 0/1 200,50 250 99 0.4
编辑:
如果需要按条件返回所有行,请删除loc
和idmax
:
df1 = df.filter(like='Sample')
#changed condition to lt(10) (<10)
df = df[(df1.xs('DP', axis=1, level=1).ge(50) & df1.xs('GQ', axis=1, level=1).lt(10)).any(1)]
print (df)
INFO Sample1 Sample2
AC DEPTH GT AD DP GQ AB GT AD DP GQ AB
0 12 100 0/1 120,60 180 5 0.5 0/1 200,2 202 99 0.01
1 23 200 0/1 200,20 60 3 0.1 0/1 200,50 250 99 0.40
答案 1 :(得分:1)
stack
第一级,并使用query
来识别索引
df.loc[df.stack(0).query('DP >= 50 & GQ < 4').unstack().index]
INFO Sample1 Sample2
AC DEPTH GT AD DP GQ AB GT AD DP GQ AB
1 23 200 0/1 200,20 60 3 0.1 0/1 200,50 250 99 0.4
我使用了@ jezrael的设置
genotype_data = [
['0/1', '120,60', 180, 5, 0.5, '0/1', '200,2', 202, 99, 0.01],
['0/1', '200,20', 60, 3, 0.1, '0/1', '200,50', 250, 99, 0.4],
['0/1', '200,2', 202, 99, 0.01, '0/1', '200,2', 202, 99, 0.01]
]
genotype_columns = [['Sample1', 'Sample2'], ['GT', 'AD', 'DP', 'GQ', 'AB']]
cols = pd.MultiIndex.from_product(genotype_columns)
genotype = pd.DataFrame(data=genotype_data, columns=cols)
info_columns = [['INFO'], ['AC', 'DEPTH']]
cols = pd.MultiIndex.from_product(info_columns)
info = pd.DataFrame(data=[[12, 100], [23, 200], [40, 40]], columns=cols)
df = pd.concat([info, genotype], axis=1)