Question

作为大型QC基准测试的一部分，我使用PdfPages后端在单个PDF中创建了大量（大约100K）的散点图。（请参阅下面的代码）

我遇到的问题是绘图需要花费太多时间，请参阅自定义分析/调试工作的输出：

Checkpoint1: Predictions done in 1.110076904296875 millis
Checkpoint2: df created and correlations calculated in 3.108978271484375 millis
Checkpoint3: plotting and accumulating done in 231.31990432739258 millis
Cycle completed in 0.23553895950317383 secs
----------------------
Checkpoint1: Predictions done in 3.718852996826172 millis
Checkpoint2: df created and correlations calculated in 2.353191375732422 millis
Checkpoint3: plotting and accumulating done in 155.93385696411133 millis
Cycle completed in 0.16200590133666992 secs
----------------------
Checkpoint1: Predictions done in 2.920866012573242 millis
Checkpoint2: df created and correlations calculated in 1.995086669921875 millis
Checkpoint3: plotting and accumulating done in 161.8819236755371 millis
Cycle completed in 0.16679787635803223 secs

如果我对点进行注释，那么绘图的数字会增加2-3倍，这是用例所必需的。正如您在下面看到的，我已经尝试了itertuples()和apply()，切换到应用并没有给我的时间带来重大变化。

def annotate(row, ax):
    ax.annotate(row.name, (row.exp, row.model),
                    xytext=(10, 20), textcoords='offset points',
                    arrowprops=dict(arrowstyle="-", connectionstyle="arc,angleA=180,armA=10"),
                    family='sans-serif', fontsize=8, color='darkslategrey')


def plot2File(df, file, seq, z, p, s):
    """ Plot predictions vs experimental """
    plttitle = f"Correlations for {seq}+{z} \n pearson={p} \n spearman={s}"
    ax = df.plot(x='exp', y='model', kind='scatter', title=plttitle, s=40)
    df.apply(annotate, ax=ax, axis=1)
#     for row in df.itertuples():
#         ax.annotate(row.Index, (row.exp, row.model),
#                     xytext=(10, 20), textcoords='offset points',
#                     arrowprops=dict(arrowstyle="-", connectionstyle="arc,angleA=180,armA=10"),
#                     family='sans-serif', fontsize=8, color='darkslategrey')

    plt.savefig(file, bbox_inches='tight', format='pdf')
    plt.close()

鉴于关于iterrows()的问题的nice explanation by Jeff，我想知道是否可以对注释过程进行矢量化？或者我应该完全放弃使用数据框？

是否可以为matplotlib矢量化注释？

0 个答案: