Question

我想从工作信息中检索数据并输出结构化的json，其中一个工作详细信息是

In [185]: text = """Company
     ...: 
     ...: Stack Overflow
     ...: 
     ...: Job Title
     ...: 
     ...: Student
     ...: 
     ...: Job Description
     ...: 
     ...: Our client is providing the innovative technologies, ....
     ...: 
     ...: Requirements
     ...: .....
     ...: About the Company
     ...: 
     ...: At ...., we are a specialized ..
     ...: 
     ...: Contact Info
     ...: ...
     ...: """

我尝试使用命名组进行提取

jobs_regex = re.compile(r"""
(?P<company>Company(?<=Company).*(?:=Job Title))
# the parts between "Company and Job Title
(?P<job_title>Job Title(?<=Job Title).*(?:=Job Description))
# the parts between "Job Title and Job Description
....
""",re.VERBOSE)

但是，当我运行它时，它会得到一个空列表

In [188]: jobs_regex.findall(text)
Out[188]: []

我该如何解决环视（？:)（？<=）？

Answer 1

我不知道您是否真的要使用环视功能，但这是一个不使用环视功能的简单解决方案：

Company(?P<company>.*)Job Title(?P<job_title>.*)Job Description

Answer 2

有了这个

(?P<company>Company(?<=Company).*(?:=Job Title))

除了后面的正面表情和超前行为之外，您不必要地明确要求“公司”存在。

因此，这将通过仅询问后面的匹配项并修复前瞻来解决问题：

(?P<company>(?<=Company).*(?=Job Title))

Answer 3

这里的要点是，您的re.VERBOSE模式将所有文字空白都视为格式化空白。要以这种模式匹配文字空间，您需要对其进行转义，例如Job Description => Job\ Description，或替换为\s速记字符类。附带说明一下，如果您打算在其中添加#，请在转义正则表达式中开始注释时也转义此字符。

另一个小问题是，您尝试连续匹配两个子字符串，而在输入中它们彼此不匹配。此处可能的解决方案是使用交替运算符|划分两个模式。

这是一个固定模式：

jobs_regex = re.compile(r"""
    (?<=Company).*?(?:=Job\ Title)
      # the parts between "Company and Job Title
    | # or
    (?P<job_title>Job\ Title).*?(?:Job\ Description)
      # the parts between "Job Title and Job Description
""", re.VERBOSE)

请参见regex demo

我离开了命名组和其他对正则表达式无害的分组，因为这似乎是某些较长模式的一部分，请确保这些分组在最终的正则表达式中有意义。

在向前和向后之间选择零件

3 个答案: