如何在解析文档时忽略一些文本/表格

时间:2019-05-13 13:15:23

标签: python-3.x parsing reportlab

我只是想解析一些文档,然后在翻译后创建新的PDF文件。我在使用的这些模块中不是很熟练,因此需要一些帮助。我只是随机选择了一个PDF。

下面是我正在运行的代码。

from tika import parser
import tika
from googletrans import Translator

from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.enums import TA_JUSTIFY

# To get the number of lines in a file
def number_of_lines(filename,num=0):
    with open(filename,encoding="utf-8") as file:
        user_resp=file.read()
        for x in user_resp:
            if x=='\n': 
                num+=1
    return num+1


# To intiate Translator
translator=Translator()

# to use clint only
tika.tika.TikaClientOnly=True
data=parser.from_file('[http://www.comagrav.com/files/PDF/COMAGRAV%20MT%20PROFI%20DE.pdf][1]')['content']

# Created a parsed file
parse_file='german_parse.txt'
with open(parse_file,'w',encoding="utf-8") as file:
    file.write(data)
    print()
    print("Parsed file Created!")
    print()

# Create a translated file
translated_fle='german_trans.txt'
with open(parse_file,encoding="utf-8") as file:
    with open(translated_fle,'w',encoding="utf-8") as file_d:
        data_to_trans=file.read()
        translatteedd=translator.translate(data_to_trans,dest='en').text
        file_d.write(translatteedd)
        print("Translated file Created!")
        print()


styles=getSampleStyleSheet()
styles.add(ParagraphStyle(name='Justify', alignment=TA_JUSTIFY))

story=[]

with open(translated_fle,encoding="utf-8") as file:
    for n in range(number_of_lines(translated_fle)):
        data_to_trans=file.readline()
        story.append(Paragraph(data_to_trans, styles["Normal"]))


doc = SimpleDocTemplate("first.pdf")
doc.build(story)
print("New PDF created")

它运行良好,但我要做的是最后忽略该文档中的表。反正有这样做吗? 我今天刚刚发现了这个库,并且将进行更多实践以添加图像,更改文本和所有内容。但是我真的无法理解在解析pdf时如何忽略某些内容。

0 个答案:

没有答案