我有两个数据帧。
第一个数据帧:df_json
+------------+-----------------+-----------+------------+
| chromosome | ensembl_id | gene_end | gene_start |
+------------+-----------------+-----------+------------+
| 7 | ENSG00000122543 | 5886362 | 5879827 |
| 12 | ENSG00000111325 | 122980043 | 122974580 |
| 17 | ENSG00000181396 | 82418637 | 82389223 |
| 6 | ENSG00000119900 | 71308950 | 71288803 |
| 9 | ENSG00000106809 | 92404696 | 92383967 |
+------------+-----------------+-----------+------------+
第二个数据帧:df
+------------+-----------------+-----------+------------+
| rs_id | variant | gene_id | chromosome |
+------------+-----------------+-----------+------------+
| rs13184706 | 5:43888254:C:T | 43888254| 5 |
| rs58824264 | 5:43888493:C:T | 43888493| 5 |
+------------+-----------------+-----------+------------+
我想迭代df_json并且对于df_json中的每一行,从df中选择行,其中gene_id在范围内(gene_start,gene_end)和df ['chromosome'] == df_json ['chromosome']。另外,我需要在结果数据框中创建一个新列,其中包含来自df_json的ensembl_id。
我可以使用下面的代码实现相同的功能,但速度非常慢。我需要一种更快的方法来执行此操作,因为我需要在数百万行上执行此操作。
result_df = []
for row in df_json.itertuples():
gene_end, gene_start = row[3], row[4]
gene = df.loc[(df['gene_id'].between(gene_start, gene_end, inclusive=True)) & (df['chromosome'] == row[1])]
gene['ensembl_id'] = row[2]
result_df.append(gene)
print(krishna[0])
答案 0 :(得分:0)
您应该尽可能避免迭代pandas
数据帧行,因为这样做效率低且可读性差。
您可以使用pd.DataFrame.merge
和pd.Series.between
来实施逻辑。我已经更改了示例中的数据以使其更有趣。
import pandas as pd
df_json = pd.DataFrame({'chromosome': [7, 12, 17, 6, 9],
'ensembl_id': ['ENSG00000122543', 'ENSG00000111325', 'ENSG00000181396',
'ENSG00000119900', 'ENSG00000106809'],
'gene_end': [5886362, 122980043, 82418637, 71308950, 92404696],
'gene_start': [5879827, 122974580, 82389223, 71288803, 92383967]})
df = pd.DataFrame({'rs_id': ['rs13184706', 'rs58824264'],
'variant': ['5:43888254:C:T', '5:43888493:C:T'],
'gene_id': [5880000, 43888493],
'chromosome': [7, 9]})
res = df_json.merge(df, how='left', on='chromosome')
res = res[res['gene_id'].between(res['gene_start'], res['gene_end'])]
print(res)
# chromosome ensembl_id gene_end gene_start gene_id rs_id \
# 0 7 ENSG00000122543 5886362 5879827 5880000.0 rs13184706
# variant
# 0 5:43888254:C:T
答案 1 :(得分:0)
对大型数据集使用pyranges。它非常有效且快速:
import pyranges as pr
c = """Chromosome ensembl_id End Start
7 ENSG00000122543 5886362 5879827
12 ENSG00000111325 122980043 122974580
17 ENSG00000181396 82418637 82389223
5 MadeUp 43889000 43888253
6 ENSG00000119900 71308950 71288803
9 ENSG00000106809 92404696 92383967"""
c2 = """rs_id variant Start End Chromosome
rs13184706 5:43888254:C:T 43888254 43888256 5
rs58824264 5:43888493:C:T 43888493 43888494 5"""
gr = pr.from_string(c)
gr2 = pr.from_string(c2)
j = gr.join(gr2)
# +--------------+--------------+-----------+-----------+------------+----------------+-----------+-----------+
# | Chromosome | ensembl_id | End | Start | rs_id | variant | Start_b | End_b |
# | (category) | (object) | (int32) | (int32) | (object) | (object) | (int32) | (int32) |
# |--------------+--------------+-----------+-----------+------------+----------------+-----------+-----------|
# | 5 | MadeUp | 43889000 | 43888253 | rs13184706 | 5:43888254:C:T | 43888254 | 43888256 |
# | 5 | MadeUp | 43889000 | 43888253 | rs58824264 | 5:43888493:C:T | 43888493 | 43888494 |
# +--------------+--------------+-----------+-----------+------------+----------------+-----------+-----------+
# Unstranded PyRanges object has 2 rows and 8 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
df = j.df # as pandas df