我有一个包含5列的数据框,并想根据其值(大于0)将其中的2列(Chemo和Surgery)转换为行(诊断系列),并添加诸如年龄和诊断的信息等信息行。
这是我的数据框
import pandas as pd
data = [['A-1', 'Birth', '0', '0', '0'], ['A-1', 'Lung cancer', '25', '25','25'],['A-1', 'Death', '50', '0','0'],['A-2', 'Birth', '0', '0','0'], ['A-2','Brain cancer', '12', '12','0'],['A-2', 'Skin cancer', '20','20','20'], ['A-2', 'Current age', '23', '0','0'],['A-3', 'Birth','0','0','0'], ['A-3', 'Brain cancer', '30', '0','30'], ['A-3', 'Lung cancer', '33', '33', '0'], ['A-3', 'Current age', '35', '0','0']]
df = pd.DataFrame(data, columns=["ID", "Diagnosis", "Age at Diagnosis", "Chemo", "Surgery"])
print df
我试图获取Chemo / Surgery大于0的值,但是当我尝试将其作为一行添加时,它不起作用。
这就是我想要的最终结果。
ID Diagnosis Age at Diagnosis
0 A-1 Birth 0
1 A-1 Lung cancer 25
2 A-1 Chemo 25
3 A-1 Surgery 25
4 A-1 Death 50
5 A-2 Birth 0
6 A-2 Brain cancer 12
7 A-2 Chemo 12
8 A-2 Skin cancer 20
9 A-2 Chemo 20
10 A-2 Surgery 20
11 A-2 Current age 23
12 A-3 Birth 0
13 A-3 Brain cancer 30
14 A-3 Surgery 30
15 A-3 Lung cancer 33
16 A-3 Chemo 33
17 A-3 Current age 35
这是我尝试过的事情之一:
chem = "Chemo"
try_df = (df[chem] > 1)
nd = df[try_df]
df["Diagnosis"] = df[chem]
print df
答案 0 :(得分:3)
我们可以融化两列Chemo
和Surgery
,然后放回所有零和concat
:
# melt the two columns
new_df = df[['ID', 'Chemo', 'Surgery']].melt(id_vars='ID',
value_name='Age at Diagnosis',
var_name='Diagnosis')
# filter out the zeros
new_df = new_df[new_df['Age at Diagnosis'].ne('0')]
# concat with the original dataframe, ignoring the extra columns
new_df = pd.concat((df,new_df), sort=False, join='inner')
# sort values
new_df.sort_values(['ID','Age at Diagnosis'])
输出:
ID Diagnosis Age at Diagnosis
0 A-1 Birth 0
1 A-1 Lung cancer 25
1 A-1 Chemo 25
12 A-1 Surgery 25
2 A-1 Death 50
3 A-2 Birth 0
4 A-2 Brain cancer 12
4 A-2 Chemo 12
5 A-2 Skin cancer 20
5 A-2 Chemo 20
16 A-2 Surgery 20
6 A-2 Current age 23
7 A-3 Birth 0
8 A-3 Brain cancer 30
19 A-3 Surgery 30
9 A-3 Lung cancer 33
9 A-3 Chemo 33
10 A-3 Current age 35
答案 1 :(得分:1)
此尝试非常冗长,需要执行一些步骤。我们无法进行简单的数据透视或索引/列堆叠,因为我们需要用另一列的部分结果来修改一列。这需要拆分和追加。
首先,将您的数据框转换为我们可以使用的dtypes。
data = [['A-1', 'Birth', '0', '0', '0'], ['A-1', 'Lung cancer', '25', '25','25'],['A-1', 'Death', '50', '0','0'],['A-2', 'Birth', '0', '0','0'], ['A-2','Brain cancer', '12', '12','0'],['A-2', 'Skin cancer', '20','20','20'], ['A-2', 'Current age', '23', '0','0'],['A-3', 'Birth','0','0','0'], ['A-3', 'Brain cancer', '30', '0','30'], ['A-3', 'Lung cancer', '33', '33', '0'], ['A-3', 'Current age', '35', '0','0']]
df = pd.DataFrame(data, columns=["ID", "Diagnosis", "Age at Diagnosis", "Chemo", "Surgery"])
df[["Age at Diagnosis", "Chemo", "Surgery"]] = df[["Age at Diagnosis", "Chemo", "Surgery"]].astype(int)
现在,我们将东西分解成碎片。
# I like making a copy or resetting an index so that
# pandas is not operating off a slice
df_chemo = df[df.Chemo > 0].copy()
df_surgery = df[df.Surgery > 0].copy()
# drop columns you don't need
df_chemo.drop(["Chemo", "Surgery"], axis=1, inplace=True)
df_surgery.drop(["Chemo", "Surgery"], axis=1, inplace=True)
df.drop(["Chemo", "Surgery"], axis=1, inplace=True)
# Set Chemo and Surgery Diagnosis
df_chemo.Diagnosis = "Chemo"
df_surgery.Diagnosis = "Surgery"
然后将所有内容附加在一起。您可以这样做,因为列尺寸匹配。
df_new = df.append(df_chemo).append(df_surgery)
# make it look pretty
df_new.sort_values(["ID", "Age at Diagnosis"]).reset_index(drop=True)