我只是想解析一些文档,然后在翻译后创建新的PDF文件。我在使用的这些模块中不是很熟练,因此需要一些帮助。我只是随机选择了一个PDF。
下面是我正在运行的代码。
from tika import parser
import tika
from googletrans import Translator
from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.enums import TA_JUSTIFY
# To get the number of lines in a file
def number_of_lines(filename,num=0):
with open(filename,encoding="utf-8") as file:
user_resp=file.read()
for x in user_resp:
if x=='\n':
num+=1
return num+1
# To intiate Translator
translator=Translator()
# to use clint only
tika.tika.TikaClientOnly=True
data=parser.from_file('[http://www.comagrav.com/files/PDF/COMAGRAV%20MT%20PROFI%20DE.pdf][1]')['content']
# Created a parsed file
parse_file='german_parse.txt'
with open(parse_file,'w',encoding="utf-8") as file:
file.write(data)
print()
print("Parsed file Created!")
print()
# Create a translated file
translated_fle='german_trans.txt'
with open(parse_file,encoding="utf-8") as file:
with open(translated_fle,'w',encoding="utf-8") as file_d:
data_to_trans=file.read()
translatteedd=translator.translate(data_to_trans,dest='en').text
file_d.write(translatteedd)
print("Translated file Created!")
print()
styles=getSampleStyleSheet()
styles.add(ParagraphStyle(name='Justify', alignment=TA_JUSTIFY))
story=[]
with open(translated_fle,encoding="utf-8") as file:
for n in range(number_of_lines(translated_fle)):
data_to_trans=file.readline()
story.append(Paragraph(data_to_trans, styles["Normal"]))
doc = SimpleDocTemplate("first.pdf")
doc.build(story)
print("New PDF created")
它运行良好,但我要做的是最后忽略该文档中的表。反正有这样做吗? 我今天刚刚发现了这个库,并且将进行更多实践以添加图像,更改文本和所有内容。但是我真的无法理解在解析pdf时如何忽略某些内容。