根据第二列

时间:2017-02-16 09:10:22

标签: pandas ipython

我有两个dfs并且正在寻找一种基于df2中的行来选择(和计数)df1行的方法。

这是我的df1:

      Chromosome  Start position  End position Reference Variant  reads  \
0       chr1       109419841     109419841         C       T      1
1       chr1       197008365     197008365         C       T      1

   variation reads  % variation                 gDNA nomencl  \
0                1          100  Chr1(GRCh37):g.109419841C>T
1                1          100  Chr1(GRCh37):g.197008365C>T

            cDNA nomencl    ...    exon transcript ID          inheritance  \
0  NM_013296.4:c.-258C>T    ...       2   NM_013296.4  Autosomal recessive
1  NM_001994.2:c.*143G>A    ...     UTR   NM_001994.2  Autosomal recessive

  test type                      Phenotype male coverage male ratio covered  \
0   Unknown  Deafness, autosomal recessief             0                  0
1   Unknown          Factor 13 deficientie             0                  0

  female coverage female ratio covered ratio M:F
0               1                    1       0.0
1               1                    1       0.0

df1有以下列:

Chromosome                10561 non-null object
Start position            10561 non-null int64
End position              10561 non-null int64
Reference                 10415 non-null object
Variant                   10536 non-null object
reads                     10561 non-null int64
variation reads           10561 non-null int64
% variation               10561 non-null int64
gDNA nomencl              10561 non-null object
cDNA nomencl              10446 non-null object
protein nomencl           9997 non-null object
classification            10561 non-null object
status                    10561 non-null object
gene                      10560 non-null object
Sanger sequencing list    10561 non-null object
exon                      10502 non-null object
transcript ID             10460 non-null object
inheritance               8259 non-null object
test type                 10561 non-null object
Phenotype                 10380 non-null object
male coverage             10561 non-null int64
male ratio covered        10561 non-null int64
female coverage           10561 non-null int64
female ratio covered      10561 non-null int64

这是df2:

 Chromosome  Startposition  Endposition    Bases  Meancoverage  \
0       chr1       11073785     11074022  27831.0    117.927966
1       chr1       11076901     11077064  11803.0     72.411043

   Mediancoverage  Ratiocovered>10X  Ratiocovered>20X Genename Componentnr  \
0            97.0               1.0               1.0   TARDBP           1
1            76.0               1.0               1.0   TARDBP           2

  PositionGenes          PositionGenome                       Position
0      TARDBP.1  chr1.11073785-11074022  comp.1_chr1.11073785-11074022
1      TARDBP.2  chr1.11076901-11077064  comp.2_chr1.11076901-11077064  

我想从df1中选择df2

中的所有行
  • '染色体'的相同值。
  • df1 ['开始位置']> = df2.Startposition
  • df1 ['结束位置']< = df2.Endposition。

如果在df2的同一行中满足这三个条件,我想在df1中选择相应的行。

我已经融合了三个专栏' Chromosome' Startposition'和' Endposition'在' PositionGenome'生成一个lambda函数,但coundn没有提出任何东西。

因此,希望你能帮助我......

1 个答案:

答案 0 :(得分:0)

一个简短的更新:最后我用unix bedtools -wb解决了这个问题。如果有人能想出一个基于python的解决方案,我仍然会很高兴。