我有两个数据集。一个包含16169行乘5列,我想用其相应的名称替换其中一列。那些相应的名称来自另一个数据集。
例如:
UniProtID NAME Q15173 PPP2R5B P30154 PPP2R1B P63151 PPP2R2A DrugBankID Name Type UniProtID UniProt Name DB00001 Lepirudin BiotechDrug P00734 Prothrombin DB00002 Cetuximab BiotechDrug P00533 Epidermal growth factor receptor DB00002 Cetuximab BiotechDrug O75015 Low affinity immunoglobulin gamma Fc region receptor III-B
在此示例中,我想将所有UniProt ID替换为上层数据集示例中的相应名称。最好的方法是什么?
我是编程和python的新手,所以任何建议,帮助表示赞赏。
答案 0 :(得分:3)
我认为您需要map
Series
{}创建NaN
,如果某些值不匹配,请#change data for match
print (df1)
UniProtID NAME
0 O75015 PPP2R5B
1 P00734 PPP2R1B
2 P63151 PPP2R2A
df2['UniProt Name'] = df2['UniProtID'].map(df1.set_index('UniProtID')['NAME'])
print (df2)
DrugBankID Name Type UniProtID UniProt Name
0 DB00001 Lepirudin BiotechDrug P00734 PPP2R1B
1 DB00002 Cetuximab BiotechDrug P00533 NaN
2 DB00002 Cetuximab BiotechDrug O75015 PPP2R5B
:
NaN
如果df2['UniProt Name'] = df2['UniProtID'].map(df1.set_index('UniProtID')['NAME'])
.fillna(df2['UniProt Name'])
print (df2)
DrugBankID Name Type UniProtID \
0 DB00001 Lepirudin BiotechDrug P00734
1 DB00002 Cetuximab BiotechDrug P00533
2 DB00002 Cetuximab BiotechDrug O75015
UniProt Name
0 PPP2R1B
1 Epidermal growth factor receptor
2 PPP2R5B
需要原始值:
left
使用set_index
的解决方案 - 需要df = pd.merge(df2, df1, on="UniProtID", how='left')
df['UniProt Name'] = df['NAME'].fillna(df['UniProt Name'])
#alternative
#df['UniProt Name'] = df['NAME'].combine_first(df['UniProt Name'])
df.drop('NAME', axis=1, inplace=True)
print (df)
DrugBankID Name Type UniProtID \
0 DB00001 Lepirudin BiotechDrug P00734
1 DB00002 Cetuximab BiotechDrug P00533
2 DB00002 Cetuximab BiotechDrug O75015
UniProt Name
0 PPP2R1B
1 Epidermal growth factor receptor
2 PPP2R5B
加入merge
或fillna
,最后按combine_first
删除列:
df = pd.merge(df2, df1, on="UniProtID", how='left')
df = df.drop('UniProt Name', axis=1).rename(columns={'NAME':'UniProt Name'})
print (df)
DrugBankID Name Type UniProtID UniProt Name
0 DB00001 Lepirudin BiotechDrug P00734 PPP2R1B
1 DB00002 Cetuximab BiotechDrug P00533 NaN
2 DB00002 Cetuximab BiotechDrug O75015 PPP2R5B
android:focusable="true"
android:focusableInTouchMode="true"
答案 1 :(得分:0)
解决此问题的一般方法是在两个表上执行类似SQL的连接。
注意:对于较大的数据集,此可能会很昂贵,我还没有尝试过这种性能。
import pandas as pd
left = pd.DataFrame({"UniProtID": ["Q15173", "P30154", "P63151"],
"Name": ["PPP2R5B", "PPP2R1B", "PPP2R2A"]})
right = pd.DataFrame({"UniProtID": ["Q15173", "P30154", "P63151"],
"UniProt Name": ["Prothrombin", "Epidermal growth factor receptor", "Low affinity immunoglobulin gamma Fc region receptor III-B"],
"Type": ["BiotechDrug", "BiotechDrug", "BiotechDrug"],
"DrugBankID": ["DB00001", "DB00002", "DB00003"]})
result = pd.merge(left, right, on="UniProtID")
参考:https://pandas.pydata.org/pandas-docs/stable/merging.html#overlapping-value-columns