我最近开始与熊猫打交道,但偶然发现了我无法解决的问题。用纯Python脚本更容易做到这一点,我真的希望在熊猫中做到这一点。这是我的菜鸟问题。
具有以下数据框:
ID Sample1 quality1 Sample2 quality2 Sample3 quality3
ID1 val str1,str2,str3@num val str1,str2,str3@num val str1,str2,str3@num
ID2 val str4,str5,str63@num val str4,str5,st63@num val str4,str5,str63@num
ID3 val str1,str2,str3@num val str1,str1,str3@num val str4,str2,str3@num
ID4 val str1,str2,str3@num val str2,str2,str3@num val str1,str2,str3@num
ID5 val str4,str5,str63@num val str4,str5,st63@num val str4,str5,str63@num
我想编写一个函数以仅保留n列中具有最低要求的质量得分数的行。只有字符串的第一部分才真正重要,因此首先仅选择字符串的第一部分:
ID Sample1 quality1 Sample2 quality2 Sample3 quality3
ID1 val str1 val str1 val str1
ID2 val str4 val str4 val str4
ID3 val str1 val str1 val str4
ID4 val str1 val str2 val str1
ID5 val str4 val str3 val str4
比方说,我只想在两列中保留分数均至少为“ str4”的行,我可能会计算各列的百分比:
ID Sample1 quality1 Sample2 quality2 Sample3 quality3
ID2 val str4 val str4 val str4
ID5 val str4 val str3 val str4
这是我开始使用它的方式,只是想知道它们在哪里,但我仍然无法将它们放回原处:
for i,rows in enumerate(table_test.values):
min_val = "str4"
scores = rows[2::2]
lists = np.ndarray.tolist(scores)
for list in lists:
first_str = list.split(",")
print(i, first_str[0])
感谢您的想法或/和帮助!
答案 0 :(得分:1)
使用boolean indexing
进行布尔掩码过滤:
min_val = "str4"
df = df[df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)).sum(axis=1) >= 2]
print (df)
ID Sample1 quality1 Sample2 quality2 Sample3 \
1 ID2 val str4,str5,str63@num val str4,str5,st63@num val
4 ID5 val str4,str5,str63@num val str4,str5,st63@num val
quality3
1 str4,str5,str63@num
4 str4,str5,str63@num
或者:
min_val = "str4"
df = df[df.filter(like='quality').applymap(lambda x: x.startswith(min_val)).sum(axis=1) >= 2]
print (df)
ID Sample1 quality1 Sample2 quality2 Sample3 \
1 ID2 val str4,str5,str63@num val str4,str5,st63@num val
4 ID5 val str4,str5,str63@num val str4,str5,st63@num val
quality3
1 str4,str5,str63@num
4 str4,str5,str63@num
说明:
首先filter
所有带有quality
字符串的列:
print (df.filter(like='quality'))
quality1 quality2 quality3
0 str1,str2,str3@num str1,str2,str3@num str1,str2,str3@num
1 str4,str5,str63@num str4,str5,st63@num str4,str5,str63@num
2 str1,str2,str3@num str1,str1,str3@num str4,str2,str3@num
3 str1,str2,str3@num str2,str2,str3@num str1,str2,str3@num
4 str4,str5,str63@num str4,str5,st63@num str4,str5,str63@num
用startswith
来比较boolean DataFrame
的所有列:
print (df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)))
quality1 quality2 quality3
0 False False False
1 True True True
2 False False True
3 False False False
4 True True True
用True
来计数sum
的值-True
是类似于1
的过程:
print (df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)).sum(axis=1))
0 0
1 3
2 1
3 0
4 3
dtype: int64
按阈值比较:
print (df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)).sum(axis=1) >=2)
0 False
1 True
2 False
3 False
4 True
dtype: bool
如果还要在所有quality
列的split
前{{3}}拆分列,并分配回来:
quality
然后用min_val = "str4"
cols = df.filter(like='quality').columns
df[cols] = df[cols].apply(lambda x: x.str.split(',').str[0])
#another solution
#df[cols] = df[cols].applymap(lambda x: x.split(',')[0])
print (df)
ID Sample1 quality1 Sample2 quality2 Sample3 quality3
0 ID1 val str1 val str1 val str1
1 ID2 val str4 val str4 val str4
2 ID3 val str1 val str1 val str4
3 ID4 val str1 val str2 val str1
4 ID5 val str4 val str4 val str4
比较布尔型DataFrame,并按照与以前相同的方式进行过滤:
min_val