Question

我只是学习和学习Python，为了工作，我经历了很多pdf，因此我找到了一个将目录转换为文本文件的PDFMINER工具。然后，我制作了以下代码，告诉我pdf文件是经批准的声明还是被拒绝的声明。我不明白我怎么能说找到以＆＃34;跟踪识别码开头的字符串...＆＃34; AND是之后的18个字符并将其填入数组？

Shipper Number............................577140Pickup Date....................................06/27/17
Number of Parcels........................1Weight.............................................1 LBS
Shipper Invoice Number..............30057010Tracking Identification Number...1Z000000YW00000000
Merchandise..................................1 S NIKE EQUALS EVERYWHERE T BK B
WE HAVE BEEN UNABLE TO PROVIDE SATISFACTORY PROOF OF DELIVERY FOR THE ABOVE
SHIPMENT.  WE APOLOGIZE FOR THE INCONVENIENCE THIS CAUSES.
NPT8AEQ:000A0000LDI 07
----------------Page (1) Break----------------

任何帮助将不胜感激。这就是文本文件的样子

import os
import glob

arrayDenied = []

def iterate():
    path = 'text/'
    for infile in glob.glob(os.path.join(path, '*.txt')):
        print ('current file is:' + infile)
        check(infile)

def check(filename):
    with open(filename, 'rt') as file_contents:
        myText = file_contents.read()
        if 'DELIVERY NOTIFICATION' in myText:
            start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
            myNumber = myText[start : start+18]
            print("Denied: " + myNumber)
            arrayDenied.append(myNumber)
        elif 'Dear Customer:' in open(filename).read():
print("This claim was Approved")

startTrackingNum = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[startTrackingNum : startTrackingNum+18]

startClaimNumberIndex = myText.index("Claim Number ") + len("Claim Number ")
myClaimNumber = myText[startClaimNumberIndex : startClaimNumberIndex+11]

arrayApproved.append(myNumber + " - " + myClaimNumber)
        else:
            print("I don't know if this is approved or denied")   
iterate()
with open('Approved.csv', "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in arrayApproved:
        writer.writerow([val])
with open('Denied.csv', "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in arrayDenied:
        writer.writerow([val])
print(arrayDenied) 
print(arrayApproved)

更新：许多有用的答案，这是我采取的路线，并且如果我自己这么说的话，工作得非常好。这将节省大量的时间！以下是我未来观众的完整代码。

{{1}}

更新：添加了我已完成的代码的其余部分，将列表写入CSV文件，然后我执行some = left（）＆＃39; s和繁荣我在几分钟内就有1000个跟踪号码。这就是编程很棒的原因。

Answer 1

如果您的目标只是找到“跟踪标识号...”字符串以及随后的18个字符;您可以找到该字符串的索引，然后到达它结束的位置，并从该点切片直到后续18个字符的结尾。

# Read the text file into memory:
with open(filename, 'rt') as txt_file:
    myText = txt_file.read()
    if 'DELIVERY NOTIFICATION' in myText:
        # Find the desired string and get the subsequent 18 characters:
        start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
        myNumber = myText[start : start+18]
        arrayDenied.append(myNumber)

您还可以将附加行修改为arrayDenied.append(myText + ' ' + myNumber)或类似的内容。

Answer 2

正则表达式是您完成任务的方法。这是一种修改代码以搜索模式的方法。

import re
pattern = r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"

def check(filename):
    file_contents = open(filename, 'r').read()
    if 'DELIVERY NOTIFICATION' in file_contents:
        isDenied = True
        print ("This claim was Denied")
        print (isDenied)
        matches = re.finditer(pattern, test_str)
        for match in matches:
            print("Tracking Number = %s" % match.group().strip("."))
    elif 'Dear Customer:' in file_contents:
        isDenied = False
        print("This claim was Approved")
        print (isDenied)
    else:
        print("I don't know if this is approved or denied")

<强>解释：

r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"

(?<=Tracking Identification Number)在捕获组后面查找字符串＆＃34;跟踪标识号＆＃34;
(?:(\.+))匹配一个或多个点（.）（我们将其删除后）
[A-Z-a-z0-9]{18}匹配18个（大写或小写）字母或数字

更多关于Regex。

Answer 3

我认为这解决了你的问题，只需把它变成一个函数。

import re

string = 'Tracking Identification Number...1Z000000YW00000000'

no_dots = re.sub('\.', '', string) #Removes all dots from the string

matchObj = re.search('^Tracking Identification Number(.*)', no_dots) #Matches anything after the "Tracking Identification Number"

try:
   print (matchObj.group(1))
except:
    print("No match!")

如果您想阅读文档，请访问：https://docs.python.org/3/library/re.html#re.search

在Python中查找和提取多个文本文件中的字符串

3 个答案: