使用python

时间:2018-08-25 19:35:07

标签: python

我想从类似表格的结构中获取数字或字母数字字符

这种类似表格的结构可能包含一些垃圾数据或无序数据

例如,

''' 5. Item | 6.Marks and 7. Numberand kind of packages; 8. Ori 9. Quantity (Gross weight or 10. Invoice
number ` numbers on description of goods including Conferring other measurement), and number(s)
packages HS Code (6 digits) and brand Criterion (see value (FOB) where RVC is and date of cnaommep(ainyf apipslsiucianbglet)h.irNdapmaertoyf Overleaf Notes) appppilied (see.Overilseaaff NoNtoteess)), minvvooice(s)
invoice UF applicable)
 91501937'''

目标是在发票字段下获取数字 这是 91501937

这是OCR的输出,我有位置

这是可搜索PDF格式中的外观。 enter image description here

这里正则表达式无效的问题我尝试了tabula,但是对于tabula,这种结构被认为是垃圾

尝试过像re.search(r'(invvooice(s)).*(\d+)',first_string,re.DOTALL)这样的正则表达式,但是正则表达式非常有用,可以得到任何东西。

1 个答案:

答案 0 :(得分:1)

  

花了我一段时间,但我终于明白了。我写了这段代码,假设发票号总是在最后,但是编辑起来并不难,因此它也可以在其他地方。

这是我的解决方案

x =  "5. Item | 6.Marks and 7. Numberand kind of packages; 8. Ori 9. Quantity (Gross weight or 10. Invoice number ` numbers on description of goods including Conferring other measurement), and number(s) packages HS Code (6 digits) and brand Criterion (see value (FOB) where RVC is and date of cnaommep(ainyf apipslsiucianbglet)h.irNdapmaertoyf Overleaf Notes) appppilied (see.Overilseaaff NoNtoteess)), minvvooice(s) invoice UF applicable)  91501937"

a = x.lower()
words = a.split()
wordlist = []
for word in words:
    wordlist.append(word)

number = 0

for n in a:
    try:
        print('word number %d: %s' %(number,wordlist[number]))
        number = number + 1
    except IndexError:
        break


print('here is your number: %s' %(wordlist[-1]))
  

编辑,您不需要for n in a那部分代码,它仅用于跟踪我的进度