Question

我有一个熊猫数据框

        _id     _score      ensembl   ensembl.gene  notfound
query                   
Dnmt3a  1788    89.405594   NaN      ENSG00000119772     NaN
SUMO1   7341    85.157100   NaN      ENSG00000116030    NaN
GADD45a 1647    86.867760   NaN      ENSG00000116717    NaN
Rad17   5884    85.377050   [{u'gene': u'ENSG00000155093'}, {u'gene': u'ENSG00000282185'}]  NaN NaN
DRS     NaN     NaN         NaN       NaN               True

根据'ensembl'，'ensembl.gene'和'notfound'的值，如何找出特定实例的集合ID。输出应基于三个条件

如果“ ensembl”和“ ensembl.gene”的值均为“ NaN”，则输出为“未找到”。例如第五行。
如果'ensembl'的值为'NaN'，则只需打印'ensembl.gene'的值即可，例如第一，第二和第三行。
如果'ensembl.gene'的值为'NaN'，则打印'ensembl'的值的第一部分，例如在第四行中，“ ensembl.gene”的值为“ NaN”，因此输出是“ ensembl”值的第一部分，即ENSG00000155093。

输出应为

    Ensemble_ID
query                   
Dnmt3a  ENSG00000119772
SUMO1   ENSG00000116030
GADD45a ENSG00000116717
Rad17   ENSG00000155093
DRS     Not_found

Answer 1

如果我理解正确，这就是您所需要的：

import numpy as np

def make_id(row): 
    if row['ensembl'] is np.nan and row['ensembl.gene'] is np.nan:  # 1) If both the value of 'ensembl' and 'ensembl.gene' is 'NaN', then output is "Not found".
        return 'Not Found'
    elif row['ensembl'] is np.nan:                                  # 2) If the value of 'ensembl' is 'NaN', then just print the value of 'ensembl.gene'
        return row['ensembl.gene']
    else:                                                           # 3) (otherwise) If the value of 'ensembl.gene' is 'NaN', then print first part of the value of 'ensembl' 
        return row['ensembl'][0]['gene']



df = pd.DataFrame({'ensembl': [np.nan,[{u'gene': u'ENSG00000155093'}],np.nan], 'ensembl.gene':[1,4,5]})
df['id'] = df.apply(lambda row: make_id(row), axis=1)
print(df)  

                         ensembl  ensembl.gene               id
0                           None             1                1
1  [{'gene': 'ENSG00000155093'}]             4  ENSG00000155093
2                           None             5                5

以这种方式生成df每行的ID，并将其保存在相应的'id'列中。

注意：如果缺少的值未由np.nan表示，请将np.nan替换为算法内部的另一个占位符“ nan”

Answer 2

如果我正确理解了您的问题，则此代码解决了您的问题：

searched_id = df.loc[df['ensembl']=='ENSG00000119772'].index[0]

您可以通过以下简单方式来概括代码：

def get_index(df, pred)
    return df.loc[pred].index

这样，结果将根据谓词进行过滤，并返回相应的索引列表。使用示例如下：

pred = (df['ensemble']=='val1') & (df['ensembl.gene']=='val2') & (df['notfound']=='val3')
searched_id = get_index(df, pred)

如果我没有回答您的问题，请尝试重新表述该问题，因为目前还不清楚

Answer 3

首先创建“ ensemble.gene”列的副本。然后应用“ where”方法和一些正则表达式。最后使用“ fillna”。

df["Ensemble_ID"]=df["ensembl.gene"]
df["Ensemble_ID"]=df["Ensemble_ID"].where(df["ensembl"].isna(),df["ensembl"].str.extract(r"u'(ENSG\d+)",expand=False))
df["Ensemble_ID"].fillna("Not_found",inplace=True)

df["Ensemble_ID"]                                                                                                   
query
Dnmt3a     ENSG00000119772
SUMO1      ENSG00000116030
GADD45a    ENSG00000116717
Rad17      ENSG00000155093
DRS              Not_found
Name: Ensemble_ID, dtype: object

Answer 4

据我了解，您想找到与"_id"和'ensembl', 'ensembl.gene'的已知值相对应的'notfound'的值。在这里，如何使用玩具数据框（可轻松扩展到您的情况）。

import numpy as np
import pandas as pd
df = pd.DataFrame({'id':[0,1,2,3],
                   'col_1':[11,12,13,14],
                   'col_2':[110,120,130,140],
                   'col_3':[1100,1200,1300,1400]})
condition = np.logical_and(
    np.logical_and(df['col_1']==13,
                   df['col_2']==130),df['col_3']==1300)
print (f'the index(es) corresponding to the values of the columns is:
       {df["id"][condition].values}')
# output
the index corresponding to the values of the columns is: [2]

如何根据列值在两列之间选择数据框中的特定列？

4 个答案: