Question

我一直在努力编写一些代码来注释一个pandas数据帧中的行，其中包含来自另一个基于某些条件的pandas数据帧的元素。第一个数据框只是一个包含染色体编号和基因组位置的表：

    chr1    s1  
0   1   645710
1   1   668630
2   1   713044
3   1   738570
4   1   766600

第二个数据框包含一些注释，这些注释基于它们跨越的基因组区域以及它们出现在哪个染色体上（s = start和e = end）：

    chr s   e   state
0   chr1    10000   10600   repetive/CNV
1   chr1    10600   11137   heterochromatin
2   chr1    11137   11737   insulator
3   chr1    11737   11937   weak_transcribed
4   chr1    11937   12137   poised/weak_enhancer

现在我想在第一个daraframe中添加另一个列，指示哪个状态属于该位置。我一直在尝试迭代第一个数据帧并使用类似的条件（因为df1中的位置在df2中的两个位置之间）：

"chr" + str(df1["chr1"] == df2["chr"] 
df1["s1"] <= df2["e"] and df1["s1"] >= df2["s"]

我的一般方法是遍历d1中的每一行，然后对于每个行迭代整个df2，检查条件，如果满足条件，则将df2中的状态追加到df1中的新列。到目前为止还没有成功。作为新手python程序员：如何处理这个问题？

Answer 1

如果数据帧不是太长，解决办法可能是离开连接然后过滤掉。首先确保两个数据帧中与染色体相关的模态匹配（例如，在第一个数据帧中将1替换为chr1）然后：

df = pd.merge(df1, df2, left_on="chr1", right_on="chr", how="left")
df = df[(df["s1"] <= df["e"]) & (df["s1"] >= df["s"])]

Answer 2

如果您的第二个具有基因组区域的数据帧已经排序，您可以执行binary search，这将降低从O（n）到O（log（n））的复杂性。对于大型数据集和大量搜索，这可能是一个很大的改进。

如果它没有排序，取决于你需要做多少这些查找，我会考虑使用一个搜索树，它基本上构建了一个执行二进制搜索的结构。但是，如果只需要进行少量搜索，则需要构建一个（平衡的）搜索树，在此之前可能会产生过多的开销。

Answer 3

使用大熊猫很难做到，容易出错，并且执行速度很慢。请改用pyranges，之后再将数据作为数据框取回。

import pyranges as pr

c1 = """Chromosome Start End
chr1   10050 10051
chr1   713044 713045
chr1   11140  11141"""

c2 = """Chromosome Start End state
chr1    10000   10600   repetive/CNV
chr1    10600   11137   heterochromatin
chr1    11137   11737   insulator"""

gr1, gr2 = pr.from_string(c1), pr.from_string(c2)

j = gr1.join(gr2).drop(like="_b")
# +--------------+-----------+-----------+--------------+
# | Chromosome   |     Start |       End | state        |
# | (category)   |   (int32) |   (int32) | (object)     |
# |--------------+-----------+-----------+--------------|
# | chr1         |     10050 |     10051 | repetive/CNV |
# | chr1         |     11140 |     11141 | insulator    |
# +--------------+-----------+-----------+--------------+
# Unstranded PyRanges object has 2 rows and 4 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.

df = j.df
#   Chromosome  Start    End         state
# 0       chr1  10050  10051  repetive/CNV
# 1       chr1  11140  11141     insulator

使用python中另一个数据帧的元素注释一个数据帧

3 个答案: