Question

我需要解析文本文件，其中相关信息通常以非线性方式分布在多行中。一个例子：

1234
 1         IN THE SUPERIOR COURT OF THE STATE OF SOME STATE           
 2              IN AND FOR THE COUNTY OF SOME COUNTY                
 3                      UNLIMITED JURISDICTION                        
 4                            --o0o--                                 
 5                                                                    
 6   JOHN SMITH AND JILL SMITH,         )                             
                                        )                             
 7                  Plaintiffs,         )                             
                                        )                             
 8        vs.                           )     No. 12345
                                        )                             
 9   ACME CO, et al.,                   )                             
                                        )                             
10                  Defendants.         )                             
     ___________________________________)

我需要提取原告和被告的身份。

这些成绩单有各种各样的格式，所以我不能总是指望那些好的括号，或原告和被告的信息被整齐地包装，例如：

 1        SUPREME COURT OF THE STATE OF SOME OTHER STATE
                      COUNTY OF COUNTYVILLE
 2                  First Judicial District
                     Important Litigation
 3  --------------------------------------------------X
    THIS DOCUMENT APPLIES TO:
 4
    JOHN SMITH,
 5                            Plaintiff,          Index No.
                                                  2000-123
 6
                                            DEPOSITION
 7                  - against -             UNDER ORAL
                                            EXAMINATION
 8                                              OF
                                            JOHN SMITH,
 9                                           Volume I

10  ACME CO,
    et al,
11                            Defendants.

12  --------------------------------------------------X

两个常数是：

“原告”将在之后发生原告的姓名，但不是必须在同一条线上。
原告和被告的姓名将是大写的。

有什么想法吗？

Answer 1

我喜欢Martin's answer 这可能是使用Python的更通用的方法：

import re

# load file into memory 
# (if large files, provide some limit to how much of the file gets loaded)
with open('paren.txt','r') as f:
  paren = f.read() # example doc with parens

# match all sequences of one or more alphanumeric (or underscore) characters 
# when followed by the word `Plaintiff`; this is intentionally general
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)', paren, 
    re.DOTALL|re.MULTILINE)

# join the list separating by whitespace
str_of_matches = ' '.join(list_of_matches)

# split string by digits (line numbers)
tokens = re.split(r'\d',str_of_matches)

# plaintiffs will be in 2nd-to-last group
plaintiff = tokens[-2].strip()

试验：

with open('paren.txt','r') as f:
  paren = f.read() # example doc with parens
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)',paren,
  re.DOTALL|re.MULTILINE)
str_of_matches = ' '.join(list_of_matches)>>> tokens = re.split(r'\d', str_of_matches)
tokens = re.split(r'\d', str_of_matches)
plaintiff = tokens[-2].strip()
plaintiff
# prints 'JOHN SMITH and JILL SMITH'

with open('no_paren.txt','r') as f:
  no_paren = f.read() # example doc with no parens
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)',no_paren,
  re.DOTALL|re.MULTILINE)
str_of_matches = ' '.join(list_of_matches)
tokens = re.split(r'\d', str_of_matches)
plaintiff = tokens[-2].strip()
plaintiff
# prints 'JOHN SMITH'

解析二维文本

1 个答案: