无法在Python中重现R data.table :: foverlaps

时间:2019-05-22 09:45:52

标签: python r pandas data.table

重叠基因组学区间问题的背景下,我正在使用data.table::foverlaps。 我最近开始尝试在Python中找到等效的Foverlaps,因为每次我必须挖掘分析输出时,仅使用一种语言而不是结合使用Python和R会更好。 当然,我不是第一个提出在Python熊猫中找到与R foverlaps等效的问题的人。这些是我在SO上找到的最相关的帖子:

2015 Merge pandas dataframes where one value is between two others

2016 R foverlaps equivalent in Python

2017 How to join two dataframes for which column values are within a certain range?

2018 How to reproduce the same output of foverlaps in R with merge of pandas in python?

问题是我根本不是Python专家。因此,我选择了最相似/最容易理解的答案sqlite3

这就是我在R中的做法:

library(data.table)

interv1 <- cbind(seq(from = 3, to = 40, by = 4),seq(from = 5, to = 50, by = 5), c(rep("blue",5), rep("red", 5)), rep("+",10))
interv2 <- cbind(seq(from = 3, to = 40, by = 4),seq(from = 5, to = 50, by = 5), c(rep("blue",5), rep("red", 5)), rep("-",10))
interv  <- rbind(interv1, interv2)
interv <- data.table(interv)
colnames(interv) <- c('start', 'stop', 'color', 'strand')
interv$start <- as.integer(interv$start)
interv$stop <- as.integer(interv$stop)
interv$stop <- interv$stop -1
interv$cov <- runif(n=nrow(interv), min = 10, max = 200)

to_match <- data.table(cbind(rep(seq(from = 4, to = 43, by = 4),2), rep(c(rep("blue", 5), rep("red", 5)), 2), c(rep("-", 10), rep("+", 10))))
colnames(to_match) <- c('start', 'color', 'strand')
to_match$stop <-  to_match$start 
to_match$start <- as.integer(to_match$start)
to_match$stop <- as.integer(to_match$stop)

setkey(interv, color, strand, start, stop)
setkey(to_match, color, strand, start, stop)

overlapping_df <- foverlaps(to_match,interv)

#write.csv(x = interv, file = "Documents/script/SO/wig_foverlaps_test.txt", row.names = F)
#write.csv(x = to_match, file = "Documents/script/SO/cov_foverlaps_test.txt", row.names = F)

这就是我尝试在python中重现它的方式:

import pandas as pd
import sqlite3

cov_table = pd.DataFrame(pd.read_csv('SO/cov_foverlaps_test.txt', skiprows = [0], header=None))
cov_table.columns = ['start', 'stop', 'chrm', 'strand', 'cov']
cov_table.stop = cov_table.stop - 1


wig_file = pd.DataFrame(pd.read_csv('SO/wig_foverlaps_test.txt', header=None, skiprows = [0]))
wig_file.columns = ['i_start', 'chrm', 'i_strand', 'i_stop']

cov_cols = ['start','stop','chrm','strand','cov']
fract_cols = ['i_start','i_stop','chrm','i_strand']

cov_table = cov_table.reindex(columns = cov_cols)
wig_file = wig_file.reindex(columns = fract_cols)

cov_table.start = pd.to_numeric(cov_table['start'])
cov_table.stop = pd.to_numeric(cov_table['stop'])

wig_file.i_start = pd.to_numeric(wig_file['i_start'])
wig_file.i_stop = pd.to_numeric(wig_file['i_stop'])



conn = sqlite3.connect(':memory:')

cov_table.to_sql('cov_table', conn, index=False)
wig_file.to_sql('wig', conn, index=False)

qry = '''
    select  
        start PresTermStart,
        stop PresTermEnd,
        cov RightCov,
        i_start pos,
        strand Strand
    from
        cov_table join wig on
        i_start between start and stop and 
        cov_table.strand = wig.i_strand
     '''

test = pd.read_sql_query(qry, conn)

无论我更改什么代码,我总是会在输出(测试)中发现一些小的差异,在此示例中,我在python结果表中丢失了两行,该行的值应在该范围内并等于范围的结尾:

缺少行:

> 19   24  141.306318     24      +
> 
> 19   24  122.923700     24      -

最后,我担心如果我找到使用sqlite3的正确方法,那么与data.table::foverlaps的计算时间差将太大。

总结:

  • 我的第一个问题是ofc代码在哪里出错了?
  • 有没有一种方法更合适并且更贴近保险杠 在计算速度方面?

感谢您阅读,我希望这篇文章适合SO。

2 个答案:

答案 0 :(得分:1)

本质上,R和Python之间的合并逻辑和区间逻辑不同。

R

根据foverlaps文档,您使用的默认 any 类型在以下条件下运行:

  

让[a,b]和[c,d]为x和y的间隔,其中a <= b和c <= d。
...
对于type =“ any”,只要c <= b和d> = a,它们就会重叠。

此外,您还可以联接键的其他列。总而言之,您将施加以下逻辑(转换为SQLite列以进行比较):

foverlaps(to_match, interv) --> foverlaps(cov_table, wig)

  1. wig.i_start <= cov_table.stop (i.e., c <= b)
  2. wig.i_stop >= cov_table.start (i.e., d >= a)
  3. wig.color == cov_table.color
  4. wig.strand == cov_table.strand

Python

您正在运行INNER JOIN +间隔查询,并采用以下逻辑:

  1. wig.i_start >= cov_table.start (i.e., i_start between start and stop)
  2. wig.i_start <= cov_table.stop (i.e., i_start between start and stop)
  3. wig.strand == cov_table.strand

与R:wig.i_stop相比,Python的显着差异从未使用过;从未使用wig.i_chrm(或颜色);并且wig.i_start受到两次条件处理。

要解决此问题,请考虑以下未经测试的SQL调整,以期有望达到R结果。顺便说一句,在SQL中,最佳实践是为JOIN子句中的所有列加上别名(甚至是SELECT):

select  
   cov_table.start as PresTermStart,
   cov_table.stop as PresTermEnd,
   cov_table.cov as RightCov,
   wig.i_start as pos,
   wig.strand as Strand
from
   cov_table 
join wig 
    on cov_table.color = wig.i_chrm
   and cov_table.strand = wig.i_strand
   and wig.i_start <= cov_table.stop 
   and wig.i_stop  >= cov_table.start 

为获得更好的性能,请考虑在连接字段上使用持久(非内存)SQLite数据库和create indexes color strand 开始停止

答案 1 :(得分:1)

要在Python中进行间隔重叠,只需使用pyranges

import pyranges as pr

c1 = """Chromosome Start End Gene
1 10 20 blo
1 45 46 bla"""

c2 = """Chromosome Start End Gene
1 10 35 bip
1 25 50 P53
1 40 10000 boop"""


gr1, gr2 = pr.from_string(c1), pr.from_string(c2)

j = gr1.join(gr2)
# +--------------+-----------+-----------+------------+-----------+-----------+------------+
# |   Chromosome |     Start |       End | Gene       |   Start_b |     End_b | Gene_b     |
# |   (category) |   (int32) |   (int32) | (object)   |   (int32) |   (int32) | (object)   |
# |--------------+-----------+-----------+------------+-----------+-----------+------------|
# |            1 |        10 |        20 | blo        |        10 |        35 | bip        |
# |            1 |        45 |        46 | bla        |        25 |        50 | P53        |
# |            1 |        45 |        46 | bla        |        40 |     10000 | boop       |
# +--------------+-----------+-----------+------------+-----------+-----------+------------+
# Unstranded PyRanges object has 3 rows and 7 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.