从pandas数据框中选择基于另一个的范围的行

时间:2018-04-23 08:03:10

标签: python pandas dataframe

我有两个数据帧。

第一个数据帧:df_json

+------------+-----------------+-----------+------------+
| chromosome |   ensembl_id    | gene_end  | gene_start |
+------------+-----------------+-----------+------------+
|          7 | ENSG00000122543 |   5886362 |    5879827 |
|         12 | ENSG00000111325 | 122980043 |  122974580 |
|         17 | ENSG00000181396 |  82418637 |   82389223 |
|          6 | ENSG00000119900 |  71308950 |   71288803 |
|          9 | ENSG00000106809 |  92404696 |   92383967 |
+------------+-----------------+-----------+------------+

第二个数据帧:df

+------------+-----------------+-----------+------------+
| rs_id      |   variant       | gene_id   | chromosome |
+------------+-----------------+-----------+------------+
| rs13184706 | 5:43888254:C:T  |   43888254|      5     |
| rs58824264 | 5:43888493:C:T  |   43888493|      5     |
+------------+-----------------+-----------+------------+

我想迭代df_json并且对于df_json中的每一行,从df中选择行,其中gene_id在范围内(gene_start,gene_end)和df ['chromosome'] == df_json ['chromosome']。另外,我需要在结果数据框中创建一个新列,其中包含来自df_json的ensembl_id。

我可以使用下面的代码实现相同的功能,但速度非常慢。我需要一种更快的方法来执行此操作,因为我需要在数百万行上执行此操作。

result_df = []
for row in df_json.itertuples():
  gene_end, gene_start = row[3], row[4]
  gene = df.loc[(df['gene_id'].between(gene_start, gene_end, inclusive=True)) & (df['chromosome'] == row[1])]
  gene['ensembl_id'] = row[2]
  result_df.append(gene)
  print(krishna[0])

2 个答案:

答案 0 :(得分:0)

您应该尽可能避免迭代pandas数据帧行,因为这样做效率低且可读性差。

您可以使用pd.DataFrame.mergepd.Series.between来实施逻辑。我已经更改了示例中的数据以使其更有趣。

import pandas as pd

df_json = pd.DataFrame({'chromosome': [7, 12, 17, 6, 9],
                        'ensembl_id': ['ENSG00000122543', 'ENSG00000111325', 'ENSG00000181396',
                                       'ENSG00000119900', 'ENSG00000106809'],
                        'gene_end': [5886362, 122980043, 82418637, 71308950, 92404696],
                        'gene_start': [5879827, 122974580, 82389223, 71288803, 92383967]})

df = pd.DataFrame({'rs_id': ['rs13184706', 'rs58824264'],
                   'variant': ['5:43888254:C:T', '5:43888493:C:T'],
                   'gene_id': [5880000, 43888493],
                   'chromosome': [7, 9]})

res = df_json.merge(df, how='left', on='chromosome')
res = res[res['gene_id'].between(res['gene_start'], res['gene_end'])]

print(res)

#    chromosome       ensembl_id  gene_end  gene_start    gene_id       rs_id  \
# 0           7  ENSG00000122543   5886362     5879827  5880000.0  rs13184706   

#           variant  
# 0  5:43888254:C:T  

答案 1 :(得分:0)

对大型数据集使用pyranges。它非常有效且快速:

import pyranges as pr

c = """Chromosome    ensembl_id     End   Start
7  ENSG00000122543    5886362     5879827
12  ENSG00000111325  122980043   122974580
17  ENSG00000181396   82418637    82389223
5  MadeUp 43889000    43888253
6  ENSG00000119900   71308950    71288803
9  ENSG00000106809   92404696    92383967"""

c2 = """rs_id         variant        Start End    Chromosome
rs13184706  5:43888254:C:T     43888254 43888256      5
rs58824264  5:43888493:C:T     43888493 43888494      5"""

gr = pr.from_string(c)
gr2 = pr.from_string(c2)

j = gr.join(gr2)
# +--------------+--------------+-----------+-----------+------------+----------------+-----------+-----------+
# |   Chromosome | ensembl_id   |       End |     Start | rs_id      | variant        |   Start_b |     End_b |
# |   (category) | (object)     |   (int32) |   (int32) | (object)   | (object)       |   (int32) |   (int32) |
# |--------------+--------------+-----------+-----------+------------+----------------+-----------+-----------|
# |            5 | MadeUp       |  43889000 |  43888253 | rs13184706 | 5:43888254:C:T |  43888254 |  43888256 |
# |            5 | MadeUp       |  43889000 |  43888253 | rs58824264 | 5:43888493:C:T |  43888493 |  43888494 |
# +--------------+--------------+-----------+-----------+------------+----------------+-----------+-----------+
# Unstranded PyRanges object has 2 rows and 8 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.

df = j.df # as pandas df