Question

我有两个数据集。一个包含16169行乘5列，我想用其相应的名称替换其中一列。那些相应的名称来自另一个数据集。

例如：

UniProtID    NAME
Q15173     PPP2R5B
P30154     PPP2R1B
P63151     PPP2R2A

DrugBankID  Name    Type        UniProtID   UniProt Name
DB00001 Lepirudin   BiotechDrug P00734  Prothrombin
DB00002 Cetuximab   BiotechDrug P00533  Epidermal growth factor receptor
DB00002 Cetuximab   BiotechDrug O75015  Low affinity immunoglobulin gamma Fc region receptor III-B

在此示例中，我想将所有UniProt ID替换为上层数据集示例中的相应名称。最好的方法是什么？

我是编程和python的新手，所以任何建议，帮助表示赞赏。

Answer 1

我认为您需要map Series {}创建NaN，如果某些值不匹配，请#change data for match print (df1) UniProtID NAME 0 O75015 PPP2R5B 1 P00734 PPP2R1B 2 P63151 PPP2R2A df2['UniProt Name'] = df2['UniProtID'].map(df1.set_index('UniProtID')['NAME']) print (df2) DrugBankID Name Type UniProtID UniProt Name 0 DB00001 Lepirudin BiotechDrug P00734 PPP2R1B 1 DB00002 Cetuximab BiotechDrug P00533 NaN 2 DB00002 Cetuximab BiotechDrug O75015 PPP2R5B：

NaN

如果df2['UniProt Name'] = df2['UniProtID'].map(df1.set_index('UniProtID')['NAME']) .fillna(df2['UniProt Name']) print (df2) DrugBankID Name Type UniProtID \ 0 DB00001 Lepirudin BiotechDrug P00734 1 DB00002 Cetuximab BiotechDrug P00533 2 DB00002 Cetuximab BiotechDrug O75015 UniProt Name 0 PPP2R1B 1 Epidermal growth factor receptor 2 PPP2R5B需要原始值：

left

使用set_index的解决方案 - 需要df = pd.merge(df2, df1, on="UniProtID", how='left') df['UniProt Name'] = df['NAME'].fillna(df['UniProt Name']) #alternative #df['UniProt Name'] = df['NAME'].combine_first(df['UniProt Name']) df.drop('NAME', axis=1, inplace=True) print (df) DrugBankID Name Type UniProtID \ 0 DB00001 Lepirudin BiotechDrug P00734 1 DB00002 Cetuximab BiotechDrug P00533 2 DB00002 Cetuximab BiotechDrug O75015 UniProt Name 0 PPP2R1B 1 Epidermal growth factor receptor 2 PPP2R5B加入merge或fillna，最后按combine_first删除列：

df = pd.merge(df2, df1, on="UniProtID", how='left')
df = df.drop('UniProt Name', axis=1).rename(columns={'NAME':'UniProt Name'})
print (df)
  DrugBankID       Name         Type UniProtID UniProt Name
0    DB00001  Lepirudin  BiotechDrug    P00734      PPP2R1B
1    DB00002  Cetuximab  BiotechDrug    P00533          NaN
2    DB00002  Cetuximab  BiotechDrug    O75015      PPP2R5B

           android:focusable="true"
           android:focusableInTouchMode="true"

Answer 2

解决此问题的一般方法是在两个表上执行类似SQL的连接。

注意：对于较大的数据集，此可能会很昂贵，我还没有尝试过这种性能。

import pandas as pd

left = pd.DataFrame({"UniProtID": ["Q15173", "P30154", "P63151"],
                     "Name": ["PPP2R5B", "PPP2R1B", "PPP2R2A"]})

right = pd.DataFrame({"UniProtID": ["Q15173", "P30154", "P63151"],
                      "UniProt Name": ["Prothrombin", "Epidermal growth factor receptor", "Low affinity immunoglobulin gamma Fc region receptor III-B"],
                      "Type": ["BiotechDrug", "BiotechDrug", "BiotechDrug"],
                      "DrugBankID": ["DB00001", "DB00002", "DB00003"]})

result = pd.merge(left, right, on="UniProtID")

参考：https://pandas.pydata.org/pandas-docs/stable/merging.html#overlapping-value-columns

在python中使用相应名称替换数千行ID名称的最佳方法是什么？

2 个答案: