How to filter a dataframe using partial matches from another dataframe

时间:2015-07-29 00:06:42

标签: python csv pandas filtering dataframe

I have two dataframe and I want to use one of the dataframes to filter the other and make a new dataframe. The two dataframes have a column with similar information but it is not an exact match. I have been trying to use div { height:100vh;} h1 { background-image: url("http://gdurl.com/RGD0"); background-repeat: no-repeat; background-size: 100% 104%; max-width: 85%; height: 70%; font-size: 42px; font-weight: 700; text-align: center; margin-left: auto; but so far I keep getting str.contains when I try. Here is a sample of my dataframes and the code I have tried.

TypeError: 'Series' objects are mutable, thus they cannot be hashed

The heads of both list don't have a match but basically the ideal outcome would be something similar to the following, where the two columns that are named 'AssociatedGeneName' would be compared.

promoter = pd.read_csv('promoter_coordinate.csv')
print(promoter.head())

AssociatedGeneName            B      C    D E                                   F
            plexB_1  NC_004353.3  64381  - Drosophila melanogaster (Fruit fly)  region 
               ci_1  NC_004353.3  76925  - Drosophila melanogaster (Fruit fly)  region   
             RS3A_1  NC_004353.3  87829  - Drosophila melanogaster (Fruit fly)  region   
              pan_1  NC_004353.3  89986  + Drosophila melanogaster (Fruit fly)  region  
              pan_2  NC_004353.3  90281  + Drosophila melanogaster (Fruit fly)  region   

data = pd.read_csv('FBgn with gene name.csv')
print(data.head())
Gene AssociatedGeneName   FBgn Number     timepoint
CG10002        fkh        FBgn0000659          2   
CG10002        fkh        FBgn0000659          2   
CG10002        fkh        FBgn0000659          2   
CG10002        fkh        FBgn0000659          2   
CG10006    CG10006        FBgn0036461          2   

x = promoter[promoter['AssociatedGeneName'].str.contains(data['AssociatedGeneName'])]

Essentially I want a dataframe with all of the values in AssociatedGeneName B C D E F fkh_1 NT_033777.2 24410805 - Drosophila melanogaster (Fruit fly) region that have a partial match to the values in promoter If someone could point me the right direction I would be grateful. I am relatively new to coding, I have been using python and pandas and would prefer to keep using python to solve this problem. Here is the error I keep getting.

data['AssociatedGeneName']

2 个答案:

答案 0 :(得分:0)

str.contains接受一个字符串作为参数,并检查该字符串是否包含在每个promoter.AssociatedGene条目中,然后为每个索引(行)返回TrueFalse

但是,当您将data.AssociatedGene传递给str.contains函数时,您传递的是pandas.Series,这就是您收到错误的原因。

如果你只想要促销者部分匹配的行,那么你可以

where_inds_par = [ where(promoter.AssociatedGeneName.str.contains(partial) )[0] for partial in data.AssociatedGeneName  ]

现在,where_inds_par的每个元素本身就是一个长度为>= 0的索引数组。此外,由于您的data.AssociatedGeneName列是多余的,因此会有一些冗余,但是您可以使用set过滤掉这一点,以及一些奇特的列表理解

inds_par = list(set( i for sublist in where_inds_par for i in sublist )) # set finds the unique elements
promoter_par = promoter.ix[ promoter.index[ inds_par], ]

答案 1 :(得分:0)

首先创建一个函数,检查来自promoter的值是否与来自data的部分匹配,这将检查data

中的每个值
def contain_partial(x , y = data.AssociatedGeneName):
        res = []
        for z in y:
            res.append(z in x)
        return res

这将是函数的结果

contains = promoter.AssociatedGeneName.apply(contain_partial)

然后在最后检查是否至少有一个值为true然后返回true并过滤 promoter

promoter[contains.apply(any)]