对多个列中的部分字符串值进行熊猫df过滤

时间:2018-07-04 11:28:09

标签: python pandas filtering

我最近开始与熊猫打交道,但偶然发现了我无法解决的问题。用纯Python脚本更容易做到这一点,我真的希望在熊猫中做到这一点。这是我的菜鸟问题。

具有以下数据框:

 ID Sample1 quality1    Sample2 quality2    Sample3 quality3
ID1 val str1,str2,str3@num  val str1,str2,str3@num  val str1,str2,str3@num
ID2 val str4,str5,str63@num val str4,str5,st63@num  val str4,str5,str63@num
ID3 val str1,str2,str3@num  val str1,str1,str3@num  val str4,str2,str3@num
ID4 val str1,str2,str3@num  val str2,str2,str3@num  val str1,str2,str3@num
ID5 val str4,str5,str63@num val str4,str5,st63@num  val str4,str5,str63@num

我想编写一个函数以仅保留n列中具有最低要求的质量得分数的行。只有字符串的第一部分才真正重要,因此首先仅选择字符串的第一部分:

 ID Sample1 quality1    Sample2 quality2    Sample3 quality3
ID1 val str1    val str1    val str1
ID2 val str4    val str4    val str4
ID3 val str1    val str1    val str4
ID4 val str1    val str2    val str1
ID5 val str4    val str3    val str4

比方说,我只想在两列中保留分数均至少为“ str4”的行,我可能会计算各列的百分比:

 ID Sample1 quality1    Sample2 quality2    Sample3 quality3
ID2 val str4    val str4    val str4
ID5 val str4    val str3    val str4

这是我开始使用它的方式,只是想知道它们在哪里,但我仍然无法将它们放回原处:

for i,rows in enumerate(table_test.values):
    min_val = "str4"
    scores = rows[2::2]
    lists = np.ndarray.tolist(scores)
    for list in lists:
        first_str = list.split(",")
        print(i, first_str[0])

感谢您的想法或/和帮助!

1 个答案:

答案 0 :(得分:1)

使用boolean indexing进行布尔掩码过滤:

min_val = "str4"
df = df[df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)).sum(axis=1) >= 2]
print (df)
    ID Sample1             quality1 Sample2            quality2 Sample3  \
1  ID2     val  str4,str5,str63@num     val  str4,str5,st63@num     val   
4  ID5     val  str4,str5,str63@num     val  str4,str5,st63@num     val   

              quality3  
1  str4,str5,str63@num  
4  str4,str5,str63@num  

或者:

min_val = "str4"
df = df[df.filter(like='quality').applymap(lambda x: x.startswith(min_val)).sum(axis=1) >= 2]
print (df)
    ID Sample1             quality1 Sample2            quality2 Sample3  \
1  ID2     val  str4,str5,str63@num     val  str4,str5,st63@num     val   
4  ID5     val  str4,str5,str63@num     val  str4,str5,st63@num     val   

              quality3  
1  str4,str5,str63@num  
4  str4,str5,str63@num  

说明

首先filter所有带有quality字符串的列:

print (df.filter(like='quality'))
              quality1            quality2             quality3
0   str1,str2,str3@num  str1,str2,str3@num   str1,str2,str3@num
1  str4,str5,str63@num  str4,str5,st63@num  str4,str5,str63@num
2   str1,str2,str3@num  str1,str1,str3@num   str4,str2,str3@num
3   str1,str2,str3@num  str2,str2,str3@num   str1,str2,str3@num
4  str4,str5,str63@num  str4,str5,st63@num  str4,str5,str63@num

startswith来比较boolean DataFrame的所有列:

print (df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)))
   quality1  quality2  quality3
0     False     False     False
1      True      True      True
2     False     False      True
3     False     False     False
4      True      True      True

True来计数sum的值-True是类似于1的过程:

print (df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)).sum(axis=1))
0    0
1    3
2    1
3    0
4    3
dtype: int64

按阈值比较:

print (df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)).sum(axis=1) >=2)
0    False
1     True
2    False
3    False
4     True
dtype: bool

如果还要在所有quality列的split前{{3}}拆分列,并分配回来:

quality

然后用min_val = "str4" cols = df.filter(like='quality').columns df[cols] = df[cols].apply(lambda x: x.str.split(',').str[0]) #another solution #df[cols] = df[cols].applymap(lambda x: x.split(',')[0]) print (df) ID Sample1 quality1 Sample2 quality2 Sample3 quality3 0 ID1 val str1 val str1 val str1 1 ID2 val str4 val str4 val str4 2 ID3 val str1 val str1 val str4 3 ID4 val str1 val str2 val str1 4 ID5 val str4 val str4 val str4 比较布尔型DataFrame,并按照与以前相同的方式进行过滤:

min_val