我的数据框看起来像这样
plant ancestor1 ancestor2 ancestor3 ancestor4 ancestor5
XX XX1 XX2 XX3 XX4 XX5
YY YY1 YY2 YY3 YY4
ZY ZZ1 ZZ2 YY2 YY3 YY4
SS1 SS2 SS3
对于每一种植物,我都希望拥有最古老的祖先。最终输出应该看起来像这样
plant oldest
XX XX5
XX1 XX5
XX2 XX5
XX3 XX5
XX4 XX5
YY YY4
YY1 YY4
YY2 YY4
YY3 YY4
ZY YY4
ZZ1 YY4
ZZ2 YY4
SS1 SS3
SS2 SS3
我该如何实现?
答案 0 :(得分:2)
df2 = df.ffill(axis=1).melt(id_vars='ancestor5', value_name='plant')
df2 = df2.rename(columns={'ancestor5': 'oldest'}).drop(columns='variable')
df2 = df2[df2['oldest'] != df2['plant']]
print(df2)
oldest plant
0 XX5 XX
1 YY4 YY
2 YY4 ZY
3 SS3 SS1
4 XX5 XX1
5 YY4 YY1
6 YY4 ZZ1
7 SS3 SS2
8 XX5 XX2
9 YY4 YY2
10 YY4 ZZ2
12 XX5 XX3
13 YY4 YY3
14 YY4 YY2
16 XX5 XX4
18 YY4 YY3
说明:使用melt转换为长格式的数据框,但在执行此操作之前,请确保使用ffill确保有一列始终包含祖先。稍后,删除通过正向填充重复值的行。
答案 1 :(得分:1)
这是一种使用numpy isin,重复和串联以及列表推导的快速方法。这种方式还允许空祖先位置为空字符串,无或任何其他占位符。
df_vals = df.values
# count the number of sub-ancestors in each row
repeats = (~np.isin(df_vals, ['', None])).sum(axis=1) - 1
# find the oldest ancestor in each row
oldest_ancestors = np.array([df_vals[row, col] for row, col in enumerate(repeats)])
# make the oldest column by repeating the each oldest ancestor for each sub-ancestor
oldest = np.repeat(oldest_ancestors, repeats)
# make the plant column by getting all the sub-ancestors from each row and concatenating
plant = np.concatenate([df_vals[row][:col] for row, col in enumerate(repeats)])
df2 = pd.DataFrame({'plant': plant, 'oldest': oldest})
-
print(df2)
plant oldest
0 XX XX5
1 XX1 XX5
2 XX2 XX5
3 XX3 XX5
4 XX4 XX5
5 YY YY4
6 YY1 YY4
7 YY2 YY4
8 YY3 YY4
9 ZY YY4
10 ZZ1 YY4
11 ZZ2 YY4
12 YY2 YY4
13 YY3 YY4
14 SS1 SS3
15 SS2 SS3
设置数据框:
df = pd.DataFrame({'plant': ['XX', 'YY', 'ZY', 'SS1'],
'ancestor1': ['XX1', 'YY1', 'ZZ1', 'SS2'],
'ancestor2': ['XX2', 'YY2', 'ZZ2', 'SS3'],
'ancestor3': ['XX3', 'YY3', 'YY2', None],
'ancestor4': ['XX4', 'YY4', 'YY3', None],
'ancestor5': ['XX5', None, 'YY4', None]})
答案 2 :(得分:0)
也许这是
df = pd.DataFrame({'plant': ['x', 'y','z'],
'ancestor1':['X1','Y1','Z2'],
'ancestor2':['X2','','Z2'],
'ancestor3':['X3','','']})
df['oldest'] = [list(filter(len,list(df.iloc[i])))[-1] for i in range(len(df))]
答案 3 :(得分:0)
这是使用列表理解的另一种方法(也许有点难看)。
dfout = pd.DataFrame([
(y, x[-1]) for x in [[i for i in ii if i] for ii in df.values]
for y in x[:-1]
], columns = ['plant', 'oldest']
)
完整示例:
import pandas as pd
df = pd.DataFrame({
'plant': ['XX','YY','ZY'],
'ancestor1': ['XX1','YY1','ZZ1'],
'ancestor2': ['XX2','YY2',''],
'ancestor3': ['XX3','','']
})
df = df[['plant','ancestor1','ancestor2','ancestor3']]
dfout = pd.DataFrame([
(y, x[-1]) for x in [[i for i in ii if i] for ii in df.values]
for y in x[:-1]
], columns = ['plant', 'oldest']
)
print(dfout)
返回:
plant oldest
0 XX XX3
1 XX1 XX3
2 XX2 XX3
3 YY YY2
4 YY1 YY2
5 ZY ZZ1