我想识别具有与其他行的开始和停止位置重叠的起始位置和停止位置的行。有一些限制适用:
以下是数据集的最小表示形式:
id type start stop
0 1 AP 0 10
1 2 AP 3 7
2 3 ES 5 15
3 4 ES 12 18
这是一张更好地描述问题的图片。每个框代表一个事件/行,数字代表他们的ID
:
这是我想要的输出:
id type start stop number_of_overlapping_exons
0 1 AP 0 10 2
1 2 AP 3 7 2
我想找到type
等于AP的行,这些行具有与其位置重叠的其他行(任何类型)。在上图中,蓝色框表示AP事件。有两个事件/行重叠蓝框1(框2和3),因此number_of_overlapping_exons
1的ID
应为2.蓝框2也有两个重叠事件(框1和3)。这是我到目前为止所得到的:
import pandas as pd
# Sample input
df = pd.DataFrame({
"id": [1, 2, 3, 4],
"type": ["AP", "AP", "ES", "ES"],
"start": [0, 3, 5, 12],
"stop": [10, 7, 15, 18]
})
# Extract only AP events
ap = df.loc[df.type == "AP"]
# Find events that overlap start positions in "ap"
# by identifying "start" or "stop" positions in "df"
# that are greater or equal to "start" positions in "ap".
overlapping_start_positions = df.loc[(df.start >= ap.start) | (df.stop >= ap.start)]
# Find events that overlap stop positions in "ap"
# by identifying "start" or "stop" positions in "df"
# that are smaller or equal to "stop" positions in "ap".
overlapping_stop_positions = df.loc[(df.start <= ap.stop) | (df.stop <= ap.stop)]
我在ValueError
说
overlapping_start_positions
ValueError: Can only compare identically-labeled Series objects
修改
来想一想,条件3:
不是真的需要。所有事件都会与自身重叠,因此我可以从number_of_overlapping_exons
中减去1。
答案 0 :(得分:1)
我认为在一次通过中有一种聪明的方法可以做到这一点,但是一个强力解决方案就是循环遍历数据帧中的行。
例如:
import pandas as pd
# Sample input
df = pd.DataFrame({
"id": [1, 2, 3, 4],
"type": ["AP", "AP", "ES", "ES"],
"start": [0, 3, 5, 12],
"stop": [10, 7, 15, 18]
})
df['count'] = 0
for row in df.itertuples():
mask = (row.start <= df.stop) & (row.stop >= df.start)
df.loc[row.Index, 'count'] = sum(mask) - 1
我们得到了
id start stop type count
0 1 0 10 AP 2
1 2 3 7 AP 2
2 3 5 15 ES 3
3 4 12 18 ES 1