在没有空格的字符后从pdf中提取文本str中的数字

时间:2018-04-16 20:26:12

标签: python python-3.x

我是python /编程的新手。我试图在字符串中的字母字符后面输入一个六位数字,如下所示:

A12345612341234 asdfa我们&a; a aslkfj4353 alsdfasA345678asA858585943

所以在上面我想拉A123456并循环拉A345678和A858585。我怎样才能做到这一点?我使用PyPDF2从pdf中提取文本并将其设置为变量,但我已经尝试过拼接和列表,但我无法弄清楚如何使其工作。我花了一些时间在网上搜索并找到了大量的例子,但它们与我的情况无关,大多数都有空白。好像它应该是真的很简单。这就是我正在做的事情

#import PyPDF2 and set extracted text as the page_content variable
import PyPDF2
pdf_file = open('5302.pdf','rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()


#initialize the user_input variable
user_input = ""

#function to get the AFE numbers from the pdf document
def get_afenumbers(Y):

    #initialize the afe and afelist variables
    afe = "A"
    afelist = ""
    x = ""

    #Make a while loop of this after figuring out how to get only 6 digits
    #after the "A" use .isdigit() somehow?
    while True:

        if user_input.upper().startswith("Y") == True:

                #Find the letter A and extract it and its following 6 digits
                if "A" in page_content:
                    #right now only getting everything after first A
                    afe = page_content[page_content.find("A")+1:]

                    #Add AFEs to afelist
                    afelist += afe

                    #Build a string of AFEs seperated by a new line character
                    x = x + '\n' + afe
                    print(afe)
                    break

                else:
                    afe = "No AFE numbers found..."

        if user_input.upper().startswith("N") == True:
            print("HAVE A GREAT DAY - GOODBYE!!!")
            break

#Build a while loop for initial question prompt (when Y or N is not True):
while user_input != "Y" and user_input != "N":
    user_input = input('List AFE numbers? Y or N: ').upper()

    if user_input not in ["Y","N"]:
        print('"',user_input,'"','is an invalid input')

get_afenumbers(user_input)

1 个答案:

答案 0 :(得分:0)

您可以使用正则表达式来提取匹配项。

忽略您的循环,我们可以使用以下方式设置要搜索的文本:

text = '''A12345612341234 asdfa we'a aslkfj4353 alsdfasA345678asA858585943'''

现在我们希望匹配任何大写字母([A-Z]),后跟任意数字中的6个([0-9]{6})。在您的代码中,您似乎只需要A,因此您可以仅使用A替换[A-Z]:

import re 
re.findall('[A-Z][0-9]{6}', text)

给出了答案:

['A123456', 'A345678', 'A858585']