Question

我有以下数据框：

            ProbeGenes  sample1  sample2  sample3
0      1431492_at Lipn     20.3      130        1
1   1448678_at Fam118a     25.3      150        2
2  1452580_a_at Mrpl21      3.1      173       12

使用此代码创建：

import pandas as pd
df = pd.DataFrame({'ProbeGenes' : ['1431492_at Lipn', '1448678_at Fam118a','1452580_a_at Mrpl21'],
                   'sample1' : [20.3, 25.3,3.1],
                   'sample2' : [130, 150,173],        
                   'sample3' : [1.0, 2.0,12.0],         
                   })

我想要做的是给出一个列表：

list_to_grep = ["Mrpl21","lipn","XXX"]

我想提取（grep）df子集ProbeGenes列list_to_grep中包含ProbeGenes sample1 sample2 sample3 1431492_at Lipn 20.3 130 1 1452580_a_at Mrpl21 3.1 173 12列的成员，产生：

> g=array(0,dim=c(3,31,31))
> dim(g)
[1] 3 31 31
> dim(g[1,,])
[1] 31 31

理想情况下，grepping是不区分大小写的模式。我怎样才能做到这一点？

Answer 1

您的示例确实需要使用正则表达式。

定义一个函数，该函数返回给定字符串是否包含列表的任何元素。

list_to_grep = ['Mrpl21', 'lipn', 'XXX']
def _grep(x, list_to_grep):
    """takes a string (x) and checks whether any string from a given 
       list of strings (list_to_grep) exists in `x`"""
    for text in list_to_grep:
        if text.lower() in x.lower():
            return True
    return False

创建一个面具：

mask = df.ProbeGenes.apply(_grep, list_to_grep=list_to_grep)

使用此掩码过滤数据框：

df[mask]

输出：

            ProbeGenes  sample1  sample2  sample3
0      1431492_at Lipn     20.3      130        1
2  1452580_a_at Mrpl21      3.1      173       12

注意，这适用于小型数据集，但是我经历了不合理的长时间将函数应用于大数据帧（~10 GB）中的文本列，其中将函数应用于列表所花费的时间更少，我不知道为什么

由于我之外的原因，这样的事情让我可以更快地过滤

>>> from functools import partial
>>> mylist = df.ProbeGenes.tolist()
>>> _greppy = partial(_grep, list_to_grep=list_to_grep)
>>> mymask = list(map(_greppy, mylist))
>>> df[mymask]

基于列表来扩展Pandas数据帧的行

1 个答案: