Question

我是Python的新手，我试图将一些法律文档分解为导出到SQL的部分。我需要做两件事：

按目录定义部分编号，
根据定义的部分编号分解文档

目录列出了章节编号：1.1,1.2,1.3等

然后文档本身被这些部分编号分解： 1.1＆＃34; ...文字......＆＃34;， 1.2＆＃34; ...文字......＆＃34;， 1.3＆＃34; ......文字......＆＃34;等

类似于书的章节，但以递增的十进制数字分隔。

我使用Tika解析了文档，并且我已经能够创建一个带有一些基本正则表达式的部分列表：

import tika
import re

from tika import parser
parsed = parser.from_file('test.pdf')
content = (parsed["content"])

headers = re.findall("[0-9]*[.][0-9]",content)

现在我需要做这样的事情：

splitsections = content.split() by headers

var_string = ', '.join('?' * len(splitsections))
query_string = 'INSERT INTO table VALUES (%s);' % var_string
cursor.execute(query_string, splitsections)

很抱歉，如果这一切都不清楚的话。对此仍然很新。

非常感谢您提供的任何帮助。

＆＃34;＆检验.pdf＃34;将是这样的文件：

http://nakedcapitalism.net/LPAs/verified-as-LPAs/Apollo_Investment_Fund_VIII_LPA_S1.pdf

目录位于第i至iii页（您获取部分编号的位置）。然后，我想要拆分的文本从第20页开始（第2.1节）。

Answer 1

除了DB的最后一部分以外的所有测试。代码也可以改进，但这是另一项任务。主要任务已经完成。

在列表split_content中，您可以获得所有信息（即2.1和2.2之间的文本，然后是2.2和2.3，依此类推，排除num +部分名称本身（即不包括{{1 }，2.1 Continuation等等。

我将2.2 Name替换为tika，因为PyPDF2没有提供此任务所需的工具（即我没有找到如何提供我需要的页数并获取其内容）。

tika

如何使用：（可能的方式之一）：

1）将上面的代码保存在def get_pdf_content(pdf_path, start_page_table_contents, end_page_table_contents, first_parsing_page, last_phrase_to_stop): """ :param pdf_path: Full path to the PDF file :param start_page_table_contents: The page where the "Contents table" starts :param end_page_table_contents: The page where the "Contents Table" ends (i.e. the number of the page where Contents Table ENDs, i.e. not the next one) :param first_parsing_page: The 1st page where we need to start data grabbing :param last_phrase_to_stop: The phrase that tells the code where to stop grabbing. The phrase must match exactly what is written in PDF. This phrase will be excluded from the grabbed data. :return: """ # ======== GRAB TABLE OF CONTENTS ======== start_page = start_page_table_contents end_page = end_page_table_contents table_of_contents_page_nums = range(start_page-1, end_page) sections_of_articles = [] # ['2.1 Continuation', '2.2 Name', ... ] open_file = open(pdf_path, "rb") pdf = PyPDF2.PdfFileReader(open_file) for page_num in table_of_contents_page_nums: page_content = pdf.getPage(page_num).extractText() page_sections = re.findall("[\d]+[.][\d][™\s\w;,-]+", page_content) for section in page_sections: cleared_section = section.replace('\n', '').strip() sections_of_articles.append(cleared_section) # ======== GRAB ALL NECESSARY CONTENT (MERGE ALL PAGES) ======== total_num_pages = pdf.getNumPages() parsing_pages = range(first_parsing_page-1, total_num_pages) full_parsing_content = '' # Merged pages for parsing_page in parsing_pages: page_content = pdf.getPage(parsing_page).extractText() cleared_page = page_content.replace('\n', '') # Remove page num from the start of "page_content" # Covers the case with the page 65, 71 and others when the "page_content" starts # with, for example, "616.6 Liability to Partners. (a) It is understood that" # i.e. "61" is the page num and "6.6 Liability ..." is the section data already_cleared = False first_50_chars = cleared_page[:51] for section in sections_of_articles: if section in first_50_chars: indx = cleared_page.index(section) cleared_page = cleared_page[indx:] already_cleared = True break # Covers all other cases if not already_cleared: page_num_to_remove = re.match(r'^\d+', cleared_page) if page_num_to_remove: cleared_page = cleared_page[len(str(page_num_to_remove.group(0))):] full_parsing_content += cleared_page # ======== BREAK ALL CONTENT INTO PIECES ACCORDING TO TABLE CONTENTS ======== split_content = [] num_sections = len(sections_of_articles) for num_section in range(num_sections): start = sections_of_articles[num_section] # Get the last piece, i.e. "11.16 FATCA" (as there is no any "end" section after "11.16 FATCA", so we cant use # the logic like "grab info between sections 11.1 and 11.2, 11.2 and 11.3 and so on") if num_section == num_sections-1: end = last_phrase_to_stop else: end = sections_of_articles[num_section + 1] content = re.search('%s(.*)%s' % (start, end), full_parsing_content).group(1) cleared_piece = content.replace('™', "'").strip() if cleared_piece[0:3] == '. ': cleared_piece = cleared_piece[3:] # There are few appearances of "[Signature Page Follows]", as a "last_phrase_to_stop". # We need the text between "11.16 FATCA" and the 1st appearance of "[Signature Page Follows]" try: indx = cleared_piece.index(end) cleared_piece = cleared_piece[:indx] except ValueError: pass split_content.append(cleared_piece) # ======== INSERT TO DB ======== # Did not test this section for piece in split_content: var_string = ', '.join('?' * len(piece)) query_string = 'INSERT INTO table VALUES (%s);' % var_string cursor.execute(query_string, parts)中 2）在python shell中：

my_pdf_code.py

将文档部分分成导出Python的列表

1 个答案: