Question

我有一个编程问题，目前无法解决。我有一个表格，如下所示：

GeneA   GeneB   Value  Distance
1       101     0.9  
1       102     1
1       103     0.8
2       201     1
2       202     1
3       301     0.9
3       302     0.8
3       303     0.8
4       401     1

在这里，我想为GeneA列中的每个基因提取GeneB列中的替换基因。该值表示与基因A相似的基因B，因此我想获得一个具有尽可能高值的GeneB，该值尽可能接近1。

在某些情况下，与基因2一样，有些基因具有相同的值。在这里，我还想获得彼此之间距离最短的基因。

我应该如何在Python中执行此操作？谢谢！

编辑：我的预期输出是具有如下表格：

GeneA   GeneB   Value   Distance
1       102     1
2       201     1
3       301     0.9
4       401     1

对于GeneB，在201或202之间进行选择，就是选择与GeneA距离最短的那一种，该距离是通过获取其遗传位置的差异而输出的。

Answer 1

我的答案受到this SO question的启发。

在您的情况下：

import pandas as pd

df = pd.DataFrame({
'GeneA': [ '1', '1', '1', '2', '2', '3', '3', '3', '4' ],
'GeneB': [ '101', '102', '103', '201', '202', '301', '302', '303', '401'],
'Value': [ 0.9, 1, 0.8, 1, 1, 0.9, 0.8, 0.8, 1 ],
})

# Sort by decreasing `Value` and then by decreasing `Distance`
df = df.sort_values(['Value', 'Distance'], ascending=False)

# Group by `GeneA` and select only the first row
df = df.groupby(['GeneA'], sort=False).first()

df

[Out]:
    GeneB   Value
GeneA       
1   102 1.0
2   201 1.0
4   401 1.0
3   301 0.9

Python：从B列中获取与A列中每个基因相关的一组基因中具有最高价值的基因

1 个答案: