Question

我有一个函数，它使用不同的参数作为输入，并查找与这些参数匹配的单元格的参考表（使用df <- airquality[, pos, drop = FALSE] > head(df) Ozone Solar.R Wind Month Day 1 41 190 7.4 5 1 2 36 118 8.0 5 2 3 12 149 12.6 5 3 4 18 313 11.5 5 4 5 NA NA 14.3 5 5 6 28 NA 14.9 5 6函数）。此函数是较大函数的一部分，但是当我对代码进行概要分析时，我意识到99％的时间都花在了尝试查找单元格上，而且我不知道是否有可能加快此过程。

执行查找的参考表约为50万行。有些列包含字符串，另一些包含浮点数。

这是配置文件代码：

.loc

从我发现的情况来看，这是我使用loc最快的方法，但是也许有一种不依赖loc的选择，它甚至会更快？

主要功能（调用该功能）应用于具有8M行的数据框列，因此即使计算时间略有减少也可以节省大量时间。

预先感谢您的帮助！

Answer 1

我通过用字典替换数据框解决了这个问题。我将两列（transcript_id和exon_number）融合为一个，并用它来构建字典。然后，我可以在字符串中使用transcript和exon来查找我的Start值。

现在看起来像这样：

def convert_position(transcript, exon, delta, genome=gtf):

    start = gtf[f'{transcript}.e{exon}']

    position = start + delta

    return position

这是配置文件代码：

Timer unit: 1e-06 s

Total time: 6e-06 s
File: <ipython-input-132-e1495f24a68e>
Function: convert_position at line 41

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    41                                           def convert_position(transcript, exon, delta, genome=gtf):
    42                                               
    43         1          5.0      5.0     83.3      start = gtf[f'{transcript}.e{exon}']
    44                                               
    45         1          1.0      1.0     16.7      position = start + delta
    46                                               
    47         1          0.0      0.0      0.0      return position

在多种情况下加快pandas .loc操作

1 个答案: