I have two dataframe and I want to use one of the dataframes to filter the other and make a new dataframe. The two dataframes have a column with similar information but it is not an exact match. I have been trying to use div {
height:100vh;}
h1 {
background-image: url("http://gdurl.com/RGD0");
background-repeat: no-repeat;
background-size: 100% 104%;
max-width: 85%;
height: 70%;
font-size: 42px;
font-weight: 700;
text-align: center;
margin-left: auto;
but so far I keep getting str.contains
when I try. Here is a sample of my dataframes and the code I have tried.
TypeError: 'Series' objects are mutable, thus they cannot be hashed
The heads of both list don't have a match but basically the ideal outcome would be something similar to the following, where the two columns that are named 'AssociatedGeneName' would be compared.
promoter = pd.read_csv('promoter_coordinate.csv')
print(promoter.head())
AssociatedGeneName B C D E F
plexB_1 NC_004353.3 64381 - Drosophila melanogaster (Fruit fly) region
ci_1 NC_004353.3 76925 - Drosophila melanogaster (Fruit fly) region
RS3A_1 NC_004353.3 87829 - Drosophila melanogaster (Fruit fly) region
pan_1 NC_004353.3 89986 + Drosophila melanogaster (Fruit fly) region
pan_2 NC_004353.3 90281 + Drosophila melanogaster (Fruit fly) region
data = pd.read_csv('FBgn with gene name.csv')
print(data.head())
Gene AssociatedGeneName FBgn Number timepoint
CG10002 fkh FBgn0000659 2
CG10002 fkh FBgn0000659 2
CG10002 fkh FBgn0000659 2
CG10002 fkh FBgn0000659 2
CG10006 CG10006 FBgn0036461 2
x = promoter[promoter['AssociatedGeneName'].str.contains(data['AssociatedGeneName'])]
Essentially I want a dataframe with all of the values in AssociatedGeneName B C D E F
fkh_1 NT_033777.2 24410805 - Drosophila melanogaster (Fruit fly) region
that have a partial match to the values in promoter
If someone could point me the right direction I would be grateful. I am relatively new to coding, I have been using python and pandas and would prefer to keep using python to solve this problem. Here is the error I keep getting.
data['AssociatedGeneName']
答案 0 :(得分:0)
str.contains
接受一个字符串作为参数,并检查该字符串是否包含在每个promoter.AssociatedGene
条目中,然后为每个索引(行)返回True
或False
。
但是,当您将data.AssociatedGene
传递给str.contains
函数时,您传递的是pandas.Series
,这就是您收到错误的原因。
如果你只想要促销者部分匹配的行,那么你可以
where_inds_par = [ where(promoter.AssociatedGeneName.str.contains(partial) )[0] for partial in data.AssociatedGeneName ]
现在,where_inds_par
的每个元素本身就是一个长度为>= 0
的索引数组。此外,由于您的data.AssociatedGeneName
列是多余的,因此会有一些冗余,但是您可以使用set
过滤掉这一点,以及一些奇特的列表理解
inds_par = list(set( i for sublist in where_inds_par for i in sublist )) # set finds the unique elements
promoter_par = promoter.ix[ promoter.index[ inds_par], ]
答案 1 :(得分:0)
首先创建一个函数,检查来自promoter
的值是否与来自data
的部分匹配,这将检查data
def contain_partial(x , y = data.AssociatedGeneName):
res = []
for z in y:
res.append(z in x)
return res
这将是函数的结果
contains = promoter.AssociatedGeneName.apply(contain_partial)
然后在最后检查是否至少有一个值为true然后返回true并过滤
promoter
promoter[contains.apply(any)]