我有一个多索引数据框和一个字典。该词典的某些键与第一个子列的某些值一致。我想根据query_name值添加一个包含字典值的新列。
这是我的数据框
S_genus
query_name
GCA_000237975.1 g__Sulfobacillus_A 118.0
GCA_000307585.2 g__Thermoanaerobacterium 118.0
g__Thermoanaerobacter 1.0
g__Ruminiclostridium_F 1.0
GCA_000404785.1 g__Cloacimonetes-1 1.0
... ...
GCF_900141705.1 g__Fibrobacter 116.0
GCF_900142435.1 g__Thermocrinis_A 113.0
GCF_900175965.1 g__Rubrobacter 116.0
GCF_900176285.1 g__Desulfacinum 118.0
GCF_900215515.1 g__Persephonella 118.0
还有我的字典
acc2genus
'GCF_001658645.1': 'g__Staphylococcus',
'GCF_900117665.1': 'g__Acinetobacter',
'GCF_000652055.1': 'g__Mycobacterium',
'GCF_003037025.1': 'g__Klebsiella',
'GCF_002138225.1': 'g__Acinetobacter',
'GCF_001186785.1': 'g__Vibrio',
'GCF_001671475.1': 'g__Mesorhizobium',
'GCF_000153745.1': 'g__Amylibacter_A',
'GCF_002814015.1': 'g__Klebsiella',
我已经尝试过类似的事情:
rdf["S_genus", "nueva"] = rdf["S_genus"].apply(lambda x: acc2genus[x])
我尝试了很多次,但遇到错误,或者丢失了第三个子列(数字)。
有人可以帮助我吗?
答案 0 :(得分:1)
您可以使用to_frame
将MultiIndex转换为DataFrame,通过其标签(query_name
)选择第一级,然后使用字典通过列表理解来转换每个值:
import pandas as pd
# example data frame, simplified
rdf = pd.DataFrame({'S_genus': [118.0, 118.0, 1.0, 1.0]},
index = pd.MultiIndex.from_tuples(
[('GCA_000237975.1', 'g__Sulfobacillus_A'),
('GCA_000307585.2', 'g__Thermoanaerobacterium'),
('GCA_000307585.2', 'g__Thermoanaerobacter'),
('GCA_000307585.2', 'g__Ruminiclostridium_F ')]))
rdf.index.names = ['query_name', '']
# example dictionary, simplified
acc2genus = dict({'GCA_000237975.1': 'Sulfo',
'GCA_000307585.2': 'Thermo'})
# new column: values from first index level translated via dictionary
rdf['nueva'] = [acc2genus[rdf.index.to_frame()['query_name'].values[i]]
for i in range(len(rdf))]
rdf
S_genus nueva
query_name
GCA_000237975.1 g__Sulfobacillus_A 118.0 Sulfo
GCA_000307585.2 g__Thermoanaerobacterium 118.0 Thermo
g__Thermoanaerobacter 1.0 Thermo
g__Ruminiclostridium_F 1.0 Thermo