我有一个数据帧(df1),其中包含基因的详细信息和与之关联的器官列表,以及另一个将这些器官加工成独特器官类型的映射数据帧(df2)。
E.g。
df1< -
data.frame ("Gene_name"=c("Gene1", "Gene2", "Gene3, "Gene4"),
"Organ_name"=c("Skin, Stomach, Eyes, Hair", "Lungs, Mouth, Oesophagus", "Pharynx, Lungs, Throat, Skin", "Stomach, Small intestine"))
df2< -
data.frame ("Type"=c("External", "External", "External", "External"......"Internal", "Internal", "Internal"...),
"Organ"=c("Skin", "Eyes", "Hair", "Legs",.... "Lungs", "Small intestine", "Oesophagus".....))
我想看看个别基因属于哪个主要类别。 是经常出现在内部还是外部?
如果我使用"Organ_name"
拆分str.split(",")
,那么在某些情况下我会收到大约20列。使用Organ作为键,将df1中的这些单独的“Organ_name”列与df2中的"Type"
合并是一个很大的痛苦。
有更好的方法来分析这些数据吗?如何知道器官"Type"
的频率/计数?请让我知道
答案 0 :(得分:1)
以下是如何使用pandas
构建逻辑的示例。
<强>设置强>
import pandas as pd
df1 = pd.DataFrame({"Gene_name": ("Gene1", "Gene2", "Gene3", "Gene4"),
"Organ_name": ("Skin, Stomach, Eyes, Hair", "Lungs, Mouth, Oesophagus",
"Pharynx, Lungs, Throat, Skin", "Stomach, Small intestine")})
df2 = pd.DataFrame({"Type": ("External", "External", "External", "External", "Internal", "Internal", "Internal"),
"Organ": ("Skin", "Eyes", "Hair", "Legs", "Lungs", "Small intestine", "Oesophagus")})
<强>解决方案强>
t = df2.set_index('Organ')['Type']
df1['Organ_list'] = df1['Organ_name'].str.split(', ')
df1['Int_Ext'] = [list(filter(None, map(t.get, x))) for x in df1['Organ_list']]
df1['Int_Ext_Flag'] = df1['Int_Ext'].apply(lambda x: 'Internal' if \
x.count('Internal') / len(x) >= 0.5 else 'External')
<强>结果强>
Gene_name Organ_name Organ_list \
0 Gene1 Skin, Stomach, Eyes, Hair [Skin, Stomach, Eyes, Hair]
1 Gene2 Lungs, Mouth, Oesophagus [Lungs, Mouth, Oesophagus]
2 Gene3 Pharynx, Lungs, Throat, Skin [Pharynx, Lungs, Throat, Skin]
3 Gene4 Stomach, Small intestine [Stomach, Small intestine]
Int_Ext Int_Ext_Flag
0 [External, External, External] External
1 [Internal, Internal] Internal
2 [Internal, External] Internal
3 [Internal] Internal
<强>解释强>
df2
创建从一个器官到另一个类型的映射。df1['Organ_list']
中的字符串以形成列表。pd.Series.apply
确定“内部”或“外部”。list(filter(None, ...))
过滤掉尚未映射到类型的器官。