我在下面的格式中有一个df,有~70000列和540行。所有值均为0.0,0.5或1.0。
VAR 1_139632_G 1_158006_T 1_172595_A 1_564650_A 1_564652_G \
SRR4216489 0.5 0.5 0.5 0.5 0.5
SRR4216786 0.5 0.5 0.5 0.5 0.5
SRR4216628 0.5 0.0 1.0 0.0 0.0
SRR4216456 0.5 0.5 0.5 0.5 0.5
SRR4216393 0.5 0.5 0.5 0.5 0.5
我想删除所有列数' 0.5'值只比行数少1。到目前为止,我已经尝试过了;
total_samples = len(df.index) # Gets the number of rows
df_col_05 = df[df == 0.5].count() # returns a df with column-wise counts
df_col_05 = df_col_05.where(df_col_05 < (total_samples-1)) #replaces with Nan where the condition isn't met
我想要的是我的原始df,在df_col_05的值为&gt; =(total_samples-1)的情况下删除了所有cols,所以基本上删除了&f; df_col_05&#39;有一个NaN,但我不知道该怎么做?
我相信对于拥有比我更多熊猫经验的人来说这应该很容易(我几天前开始)
答案 0 :(得分:4)
您可以将boolean indexing
与loc
一起用于过滤列,最好使用sum
获取size
True
DataFrame
个#if first column is not index set it
df = df.set_index('VAR')
df1 = df.loc[:, (df == 0.5).sum() >= len(df.index)-1]
:
#changed values in last 2 columns
print (df)
VAR 1_139632_G 1_158006_T 1_172595_A 1_564650_A 1_564652_G
0 SRR4216489 0.5 0.5 0.5 0.0 0.0
1 SRR4216786 0.5 0.5 0.5 0.0 0.5
2 SRR4216628 0.5 0.0 1.0 0.0 0.0
3 SRR4216456 0.5 0.5 0.5 0.5 0.5
4 SRR4216393 0.5 0.5 0.5 0.5 0.5
print (df[df == 0.5].count())
VAR 0
1_139632_G 5
1_158006_T 4
1_172595_A 4
1_564650_A 2
1_564652_G 3
dtype: int64
print ((df == 0.5).sum())
VAR 0
1_139632_G 5
1_158006_T 4
1_172595_A 4
1_564650_A 2
1_564652_G 3
dtype: int64
<强>示例强>:
#if first column is not index set it
df = df.set_index('VAR')
print ((df == 0.5).sum() >= len(df.index)-1)
1_139632_G True
1_158006_T True
1_172595_A True
1_564650_A False
1_564652_G False
dtype: bool
print (df.loc[:, (df == 0.5).sum() >= len(df.index)-1])
1_139632_G 1_158006_T 1_172595_A
VAR
SRR4216489 0.5 0.5 0.5
SRR4216786 0.5 0.5 0.5
SRR4216628 0.5 0.0 1.0
SRR4216456 0.5 0.5 0.5
SRR4216393 0.5 0.5 0.5
set_index
另一个没有m = (df == 0.5).sum() >= len(df.index)-1
print (m)
VAR False
1_139632_G True
1_158006_T True
1_172595_A True
1_564650_A False
1_564652_G False
dtype: bool
need_cols = ['VAR']
m.loc[need_cols] = True
print (m)
VAR True
1_139632_G True
1_158006_T True
1_172595_A True
1_564650_A False
1_564652_G False
dtype: bool
print (df.loc[:, m])
VAR 1_139632_G 1_158006_T 1_172595_A
0 SRR4216489 0.5 0.5 0.5
1 SRR4216786 0.5 0.5 0.5
2 SRR4216628 0.5 0.0 1.0
3 SRR4216456 0.5 0.5 0.5
4 SRR4216393 0.5 0.5 0.5
的解决方案,只需要定义输出中始终需要的列:
print (df[df.columns[m]])
VAR 1_139632_G 1_158006_T 1_172595_A 1_564652_G
0 SRR4216489 0.5 0.5 0.5 0.0
1 SRR4216786 0.5 0.5 0.5 0.5
2 SRR4216628 0.5 0.0 1.0 0.0
3 SRR4216456 0.5 0.5 0.5 0.5
4 SRR4216393 0.5 0.5 0.5 0.5
类似的解决方案是单独过滤列,然后选择:
string.IndexOf