Question

我正在一个项目，需要从电子邮件正文中提取发票编号。发票号可以在我尝试使用Python代码搜索的邮件正文中的任何位置。问题是电子邮件发件人没有使用标准关键字，例如，他们使用各种单词来提及发票编号。发票编号，发票编号，发票编号，发票编号inv-no等。

由于没有特定的关键字，这种不一致使我很难从邮件正文中提取发票编号。

阅读数百封电子邮件后，我能够识别出发票编号之前使用的最常用的单词，并创建了它们的列表（大约15个关键字）。但是我无法在字符串中搜索该关键字列表来检索它们旁边的关键字以标识发票编号，而且发票编号可以是数字和字母数字，这增加了更多的复杂性。

我试图取得一些进步，如下所述，但是没有得到想要的输出。

inv_list = ['invoice number','inv no','invoice#','invoice','invoices','inv number','invoice-number','inv-number','inv#','invoice no.'] # list of keywords used before invoice number

example_string = 'Hi Team, Could you please confirm the status of payment 
for invoice# 12345678 and AP-8765432?
Also, please confirm the status of existing invoice no. 7652908.
Thanks'

# Basic code to test if any word from inv_list exists in example_string

for item in inv_list:
    if item in example_string:
        print(item)

# gives the output like 

invoice#
invoice no.

接下来，经过几个小时的搜索，我发现了此功能how to get a list with words that are next to a specific word in a string in python，但无法将其用于单词列表。我尝试过：

def get_next_words(mailbody, invoice_text_list, sep=' '):
    mail_body_words = mailbody.split(sep)
    for word in invoice_text_list:
        if word in mail_body_words:
            yield next(mail_body_words)

words = get_next_words(example_string,inv_list)

for w in words:
    print(w)

并获得

TypeError：“列表”对象不是迭代器

预期的输出是从“ example_string”返回关键字，然后是“ inv_list”中匹配的任何关键字（我假设我可以从返回的匹配中识别发票编号）

对于给定的示例，输出应为：

Match1: 'invoice#'             
Expected Output: '12345678'

Match2: 'invoice no.'          
Expected Output:  '7652908'

请让我知道是否需要更多详细信息，我们将为您提供帮助！

Answer 1

您可以使用与现在使用的方法类似的方法，但是在相反的列表上进行迭代。另外，要利用搜索词典而不是列表的时间复杂性，请将单词列表变成词典的键。它占用更多空间，但搜索速度更快。

inv_list = {'invoice number','inv no','invoice#','invoice','invoices','inv number','invoice-number','inv-number','inv#','invoice no.'}

def get_next_words(mailbody, invoice_text_list, sep=' '):
    mail_body_words = mailbody.split(sep)
    for i in range(len(mail_body_words)):
        if mail_body_words[i] in invoice_text_list:
            yield mail_body_words[i+1]
        elif f'{mail_body_words[i]} {mail_body_words[i+1]}' in invoice_text_list:
            yield mail_body_words[i+1]
words = get_next_words(example_string, inv_list)

for w in words:
    print(w)

Answer 2

可能不是最有效的代码，但是可以工作...需要两种情况才能区别f.e. inv no 06363636和inv 06363636由于inv和no之间的空白...

arr = example_string.split(' ')
for ix in range(len(arr)):
    try: 
        if arr[ix]+" "+arr[ix+1] in inv_list:
            print(arr[ix+2].strip('.'))
        elif arr[ix] in inv_list:
            print(arr[ix+1].strip('.'))
    except IndexError:
        pass

Answer 3

我对atsteich给出的答案做了一些修改，以使其在我的场景中更有用，基本上我只想捕获数字值作为发票号，并删除一些可能与发票号一起出现的标点符号。

下面是代码：

arr = example_string.split(' ')
remove_symbols = str.maketrans("","",".,-")

for ix in range(len(arr)):
    try: 
        if arr[ix]+" "+arr[ix+1] in inv_list and arr[ix+2].translate(remove_symbols).isdigit():
            print('Invoice number found:'+arr[ix+2].translate(remove_symbols))
        elif arr[ix] in inv_list and arr[ix+1].translate(remove_symbols).isdigit():
            print('Invoice number found:'+arr[ix+1].translate(remove_symbols))
     except IndexError:
        pass

感谢大家的支持！

如何将列表中的元素搜索成字符串并提取匹配项旁边的关键字

3 个答案: