Question

我已经陷入了必须将多个pandas数据帧写入PDF文件的位置。该函数接受数据帧作为输入。

但是，我第一次能够写入PDF，但所有后续调用都覆盖了现有数据，最后只留下了PDF中的一个数据帧。

请在下面找到python函数：

def fn_print_pdf(df):
 pp = PdfPages('Sample.pdf')
 total_rows, total_cols = df.shape;

 rows_per_page = 30; # Number of rows per page
 rows_printed = 0
 page_number = 1;
 while (total_rows >0):
    fig=plt.figure(figsize=(8.5, 11))
    plt.gca().axis('off')
    matplotlib_tab = pd.tools.plotting.table(plt.gca(),df.iloc[rows_printed:rows_printed+rows_per_page],
        loc='upper center', colWidths=[0.15]*total_cols)
    #Tabular styling
    table_props=matplotlib_tab.properties()
    table_cells=table_props['child_artists']
    for cell in table_cells:
        cell.set_height(0.024)
        cell.set_fontsize(12)
    # Header,Footer and Page Number
    fig.text(4.25/8.5, 10.5/11., "Sample", ha='center', fontsize=12)
    fig.text(4.25/8.5, 0.5/11., 'P'+str(page_number), ha='center', fontsize=12)
    pp.savefig()
    plt.close()
    #Update variables
    rows_printed += rows_per_page;
    total_rows -= rows_per_page;
    page_number+=1;
 pp.close()

我将此函数称为::

raw_data = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
df_a = pd.DataFrame(raw_data, columns=['subject_id', 'first_name', 'last_name'])
fn_print_pdf(df_a)

raw_data = {
    'subject_id': ['4', '5', '6', '7', '8'],
    'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
    'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
df_b = pd.DataFrame(raw_data, columns=['subject_id', 'first_name', 'last_name'])
fn_print_pdf(df_b)

PDF文件可在以下地址获取 SamplePDF 你可以看到，只有第二个数据帧中的数据最终被保存。有没有办法在第二遍中附加到相同的Sample.pdf，依旧保留以前的数据？

Answer 1

您的PDF被覆盖，因为您每次拨打fn_print_pdf()时都会创建新的PDF文档。您可以尝试在函数调用之间保持PdfPages实例处于打开状态，并在写完所有绘图后调用pp.close()。有关参考，请参阅this answer。

另一个选择是将PDF写入不同的文件，并使用pyPDF合并它们，请参阅this answer。

编辑：这是第一种方法的一些工作代码。

您的功能已修改为：

def fn_print_pdf(df,pp): 
 total_rows, total_cols = df.shape;

 rows_per_page = 30; # Number of rows per page
 rows_printed = 0
 page_number = 1;
 while (total_rows >0):
    fig=plt.figure(figsize=(8.5, 11))
    plt.gca().axis('off')
    matplotlib_tab = pd.tools.plotting.table(plt.gca(),df.iloc[rows_printed:rows_printed+rows_per_page],
        loc='upper center', colWidths=[0.15]*total_cols)
    #Tabular styling
    table_props=matplotlib_tab.properties()
    table_cells=table_props['child_artists']
    for cell in table_cells:
        cell.set_height(0.024)
        cell.set_fontsize(12)
    # Header,Footer and Page Number
    fig.text(4.25/8.5, 10.5/11., "Sample", ha='center', fontsize=12)
    fig.text(4.25/8.5, 0.5/11., 'P'+str(page_number), ha='center', fontsize=12)
    pp.savefig()
    plt.close()
    #Update variables
    rows_printed += rows_per_page;
    total_rows -= rows_per_page;
    page_number+=1;

使用以下方法调用您的函数：

pp = PdfPages('Sample.pdf')
fn_print_pdf(df_a,pp)
fn_print_pdf(df_b,pp)   
pp.close()

使用matplotlib

1 个答案: