我有以下代码段
import pdfplumber, requests
from io import BytesIO
import pandas as pd
def get_title_liked_txt(page: object):
df = pd.DataFrame(page.chars)
title_liked_fontsizes = df['size'].value_counts().sort_index(ascending=False).index[:2]
df = df[df['size'].isin(title_liked_fontsizes)]
title_like_txt_df = df.groupby(['top', 'bottom'])['text'].apply(''.join).reset_index()
print(title_like_txt_df)
url = 'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0417/2020041700700.pdf'
response = requests.get(url)
stream = BytesIO(response.content)
plumber_pdf = pdfplumber.open(stream)
page = plumber_pdf.pages[111]
get_title_liked_txt(page)
它产生
top bottom text
0 59.735 77.735 ’
1 59.879 77.879 INDEPENDENT AUDITORS REPORT
2 311.317 322.317 Opinion
3 554.151 565.151 Basis for opinion
我想给top
和bottom
分组留出一定的距离。
将它们分组时,如果当前行与上一行之间的差小于0.5,则将它们视为相同的值。这样结果中的row_0
将相应地加入。
这是预期的结果
top bottom text
0 59.879 77.879 INDEPENDENT AUDITOR’S REPORT
1 311.317 322.317 Opinion
2 554.151 565.151 Basis for opinion
我发现了这样的东西
cond = df['top'].diff().abs() < 0.5
但是如果满足此条件,我不确定如何替换以前的值。任何建议将不胜感激。
编辑:其他信息
这是分组之前的数据框
fontname adv upright x0 y0 x1 y1 width height size object_type page_number stroking_color non_stroking_color text top bottom doctop
94 MBPGXA+TrajanPro-Bold 0.452 1 25.512 729.995 33.648 747.995 8.136 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) I 59.879 77.879 89733.893
95 MBPGXA+TrajanPro-Bold 0.947 1 33.198 729.995 50.244 747.995 17.046 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) N 59.879 77.879 89733.893
96 MBPGXA+TrajanPro-Bold 0.936 1 49.794 729.995 66.642 747.995 16.848 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) D 59.879 77.879 89733.893
97 MBPGXA+TrajanPro-Bold 0.632 1 66.192 729.995 77.568 747.995 11.376 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) E 59.879 77.879 89733.893
98 MBPGXA+TrajanPro-Bold 0.655 1 77.118 729.995 88.908 747.995 11.790 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) P 59.879 77.879 89733.893
99 MBPGXA+TrajanPro-Bold 0.632 1 88.458 729.995 99.834 747.995 11.376 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) E 59.879 77.879 89733.893
100 MBPGXA+TrajanPro-Bold 0.947 1 99.384 729.995 116.430 747.995 17.046 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) N 59.879 77.879 89733.893
101 MBPGXA+TrajanPro-Bold 0.936 1 115.980 729.995 132.828 747.995 16.848 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) D 59.879 77.879 89733.893
102 MBPGXA+TrajanPro-Bold 0.632 1 132.378 729.995 143.754 747.995 11.376 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) E 59.879 77.879 89733.893
103 MBPGXA+TrajanPro-Bold 0.947 1 143.304 729.995 160.350 747.995 17.046 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) N 59.879 77.879 89733.893
104 MBPGXA+TrajanPro-Bold 0.710 1 159.900 729.995 172.680 747.995 12.780 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) T 59.879 77.879 89733.893
105 MBPGXA+TrajanPro-Bold 0.300 1 172.230 729.995 177.630 747.995 5.400 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) 59.879 77.879 89733.893
106 MBPGXA+TrajanPro-Bold 0.700 1 177.180 729.995 189.780 747.995 12.600 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) A 59.879 77.879 89733.893
107 MBPGXA+TrajanPro-Bold 0.852 1 189.330 729.995 204.666 747.995 15.336 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) U 59.879 77.879 89733.893
108 MBPGXA+TrajanPro-Bold 0.936 1 204.216 729.995 221.064 747.995 16.848 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) D 59.879 77.879 89733.893
109 MBPGXA+TrajanPro-Bold 0.452 1 220.614 729.995 228.750 747.995 8.136 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) I 59.879 77.879 89733.893
110 MBPGXA+TrajanPro-Bold 0.710 1 228.300 729.995 241.080 747.995 12.780 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) T 59.879 77.879 89733.893
111 MBPGXA+TrajanPro-Bold 0.927 1 240.630 729.995 257.316 747.995 16.686 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) O 59.879 77.879 89733.893
112 MBPGXA+TrajanPro-Bold 0.755 1 256.866 729.995 270.456 747.995 13.590 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) R 59.879 77.879 89733.893
113 MBPGXA+TrajanPro-Bold 0.218 1 270.006 730.139 273.930 748.139 3.924 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) ’ 59.735 77.735 89733.749
114 MBPGXA+TrajanPro-Bold 0.582 1 273.480 729.995 283.956 747.995 10.476 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) S 59.879 77.879 89733.893
115 MBPGXA+TrajanPro-Bold 0.300 1 283.506 729.995 288.906 747.995 5.400 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) 59.879 77.879 89733.893
116 MBPGXA+TrajanPro-Bold 0.755 1 288.456 729.995 302.046 747.995 13.590 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) R 59.879 77.879 89733.893
117 MBPGXA+TrajanPro-Bold 0.632 1 301.596 729.995 312.972 747.995 11.376 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) E 59.879 77.879 89733.893
118 MBPGXA+TrajanPro-Bold 0.655 1 312.522 729.995 324.312 747.995 11.790 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) P 59.879 77.879 89733.893
119 MBPGXA+TrajanPro-Bold 0.927 1 323.862 729.995 340.548 747.995 16.686 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) O 59.879 77.879 89733.893
120 MBPGXA+TrajanPro-Bold 0.755 1 340.098 729.995 353.688 747.995 13.590 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) R 59.879 77.879 89733.893
121 MBPGXA+TrajanPro-Bold 0.710 1 353.238 729.995 366.018 747.995 12.780 18.000 18.000 char 112 (0, 0, 0, 0.6) (0, 0, 0, 1) T 59.879 77.879 89733.893
416 MBPGXA+TrajanPro-Bold 0.927 1 56.693 485.557 66.890 496.557 10.197 11.000 11.000 char 112 None [1] O 311.317 322.317 89985.331
417 MBPGXA+TrajanPro-Bold 0.596 1 67.220 485.557 73.776 496.557 6.556 11.000 11.000 char 112 None [1] p 311.317 322.317 89985.331
418 MBPGXA+TrajanPro-Bold 0.407 1 74.106 485.557 78.583 496.557 4.477 11.000 11.000 char 112 None [1] i 311.317 322.317 89985.331
419 MBPGXA+TrajanPro-Bold 0.841 1 78.913 485.557 88.164 496.557 9.251 11.000 11.000 char 112 None [1] n 311.317 322.317 89985.331
420 MBPGXA+TrajanPro-Bold 0.407 1 88.494 485.557 92.971 496.557 4.477 11.000 11.000 char 112 None [1] i 311.317 322.317 89985.331
421 MBPGXA+TrajanPro-Bold 0.827 1 93.301 485.557 102.398 496.557 9.097 11.000 11.000 char 112 None [1] o 311.317 322.317 89985.331
422 MBPGXA+TrajanPro-Bold 0.841 1 102.728 485.557 111.979 496.557 9.251 11.000 11.000 char 112 None [1] n 311.317 322.317 89985.331
2200 MBPGXA+TrajanPro-Bold 0.707 1 56.693 242.723 64.470 253.723 7.777 11.000 11.000 char 112 None [1] B 554.151 565.151 90228.165
2201 MBPGXA+TrajanPro-Bold 0.632 1 64.800 242.723 71.752 253.723 6.952 11.000 11.000 char 112 None [1] a 554.151 565.151 90228.165
2202 MBPGXA+TrajanPro-Bold 0.540 1 72.082 242.723 78.022 253.723 5.940 11.000 11.000 char 112 None [1] s 554.151 565.151 90228.165
2203 MBPGXA+TrajanPro-Bold 0.407 1 78.352 242.723 82.829 253.723 4.477 11.000 11.000 char 112 None [1] i 554.151 565.151 90228.165
2204 MBPGXA+TrajanPro-Bold 0.540 1 83.159 242.723 89.099 253.723 5.940 11.000 11.000 char 112 None [1] s 554.151 565.151 90228.165
2205 MBPGXA+TrajanPro-Bold 0.300 1 89.429 242.723 92.729 253.723 3.300 11.000 11.000 char 112 None [1] 554.151 565.151 90228.165
2206 MBPGXA+TrajanPro-Bold 0.567 1 93.389 242.723 99.626 253.723 6.237 11.000 11.000 char 112 None [1] f 554.151 565.151 90228.165
2207 MBPGXA+TrajanPro-Bold 0.827 1 99.956 242.723 109.053 253.723 9.097 11.000 11.000 char 112 None [1] o 554.151 565.151 90228.165
2208 MBPGXA+TrajanPro-Bold 0.686 1 109.383 242.723 116.929 253.723 7.546 11.000 11.000 char 112 None [1] r 554.151 565.151 90228.165
2209 MBPGXA+TrajanPro-Bold 0.300 1 117.259 242.723 120.559 253.723 3.300 11.000 11.000 char 112 None [1] 554.151 565.151 90228.165
2210 MBPGXA+TrajanPro-Bold 0.827 1 121.219 242.723 130.316 253.723 9.097 11.000 11.000 char 112 None [1] o 554.151 565.151 90228.165
2211 MBPGXA+TrajanPro-Bold 0.596 1 130.646 242.723 137.202 253.723 6.556 11.000 11.000 char 112 None [1] p 554.151 565.151 90228.165
2212 MBPGXA+TrajanPro-Bold 0.407 1 137.532 242.723 142.009 253.723 4.477 11.000 11.000 char 112 None [1] i 554.151 565.151 90228.165
2213 MBPGXA+TrajanPro-Bold 0.841 1 142.339 242.723 151.590 253.723 9.251 11.000 11.000 char 112 None [1] n 554.151 565.151 90228.165
2214 MBPGXA+TrajanPro-Bold 0.407 1 151.920 242.723 156.397 253.723 4.477 11.000 11.000 char 112 None [1] i 554.151 565.151 90228.165
2215 MBPGXA+TrajanPro-Bold 0.827 1 156.727 242.723 165.824 253.723 9.097 11.000 11.000 char 112 None [1] o 554.151 565.151 90228.165
2216 MBPGXA+TrajanPro-Bold 0.841 1 166.154 242.723 175.405 253.723 9.251 11.000 11.000 char 112 None [1] n 554.151 565.151 90228.165
答案 0 :(得分:0)
尝试:
def get_title_liked_txt(page: object):
df = pd.DataFrame(page.chars)
title_liked_fontsizes = df['size'].value_counts().sort_index(ascending=False).index[:2]
df = df[df['size'].isin(title_liked_fontsizes)]
df['cat'] = df.top.diff().gt(0.5).cumsum() + 1
df_temp = df.groupby(['cat'])['text'].apply(''.join).reset_index()
df_temp = df_temp.merge(df.groupby('cat')['top'].first().reset_index(),on='cat')
df_temp = df_temp.merge(df.groupby('cat')['bottom'].first().reset_index(),on='cat')
return df_temp[['top', 'bottom', 'text']]
get_title_liked_txt(page)
top bottom text
0 59.879 77.879 INDEPENDENT AUDITOR'S REPORT
1 311.317 322.317 Opinion
2 554.151 565.151 Basis for opinion
答案 1 :(得分:0)
您可以使用舍入值创建两个新列,然后对舍入值进行分组并显示最后一个值。
df['top_r'] = df['top'].round()
df['bottom_r']=df['bottom'].round()
df.groupby(['top_r','bottom_r']).last()
top_r bottom_r顶部底部的文本 60.0 78.0 59.879 77.879独立审计师报告 311.0 322.0 311.317 322.317意见 554.0 565.0 554.151 565.151意见依据