使用Python中的正则表达式搜索从PDF转换的课程

时间:2014-01-03 00:01:22

标签: python regex

我正在Python中编写一个正则表达式来搜索txt文档中的字符串。我正在寻找的字符串如下:

  1. ACCT 221 Principles of Accounting II (3) Prerequisite: ACCT 220
  2. ASTD 485 Issues in East Asian Studies (3) (Intended as a final capstone course to be taken in a student's last 15 credits.) Prerequisites: ASTD 284 (or ASTD 150) and 285 (or ASTD 160).
  3. ASTR 100 Introduction to Astronomy (3) (Not open to students who have taken or are taking any astronomy course numbered 250 or higher. For students not majoring or minoring in a science.) Prerequisite: MATH 012 or higher.
  4. ASTD 380 American Relations with China and Japan: 1740 to Present (3) (Fulfills the general education requirement in the social sciences.) A study of American political, economic, and cultural relations with China and Japan from the American colonial era to modern times…
  5. 我希望找到的表达式是以课程代码i.e. ACCT 221开头并以包含先决条件的句子结尾的字符串。在某些情况下,不会有先决条件句子,如例4所示。

    这是我到目前为止所拥有的:

    [A-Z]{4} \d{3}(?:(?![A-Z]{4}).){4,100} \(\d\).*?\.(?!\))
    

    这适用于示例1和2,但不适用于示例3(我实际上添加了(?!\))来捕获实例中的内容,例如示例2,没有意识到存在多个句子的实例,因此在句子内部插入语)。

    我认为我正在寻找的是一种搜索字符串的方法,该字符串以我写到\(\d\)的表达式开头,并以 NOT 的句点结束括号内,无论括号在哪里。我试图在最后添加.*到负面预测,但这不能正常工作。我试图添加.*?以使其非贪婪,因此它不会从第一个课程代码开始返回整个文件,但它没有任何区别。

    我觉得我在这里错过了一些非常简单的事情。提前感谢您的帮助。

    如果我需要澄清任何内容,请告诉我。

3 个答案:

答案 0 :(得分:1)

只有括号不嵌套才有可能:

[A-Z]{4} \d{3}(?:(?=([^.()]+))\1|\([^)]*\))+\.

答案 1 :(得分:1)

你正在寻找从四个字母部门到“先决条件”之后的第一个时期的所有内容,对吧?所以明确说明。

>>IN:
txt = """
ACCT 221 Principles of Accounting II (3) Prerequisite: ACCT 220.
ASTD 485 Issues in East Asian Studies (3) (Intended as a final capstone course to be
taken in a student's last 15 credits.) Prerequisites: ASTD 284 (or ASTD 150) and 285
(or ASTD 160).
ASTR 100 Introduction to Astronomy (3) (Not open to students who have taken or are
taking any astronomy course numbered 250 or higher. For students not majoring or
minoring in a science.) Prerequisite: MATH 012 or higher."""

pat = re.compile([A-Z]{4}.*?Prerequisites?.*?\.)
courses = pat.findall(txt)
for course in courses:
    print(course+"\n")

>>OUT:
ACCT 221 Principles of Accounting II (3) Prerequisite: ACCT 220.

ASTD 485 Issues in East Asian Studies (3) (Intended as a final capstone course to be
taken in a student's last 15 credits.) Prerequisites: ASTD 284 (or ASTD 150) and 285
(or ASTD 160).

ASTR 100 Introduction to Astronomy (3) (Not open to students who have taken or are
taking any astronomy course numbered 250 or higher. For students not majoring or
minoring in a science.) Prerequisite: MATH 012 or higher.

答案 2 :(得分:1)

将两个正则表达式用于更简单的正则表达式没有任何问题:

import re

text = '''\
ACCT 221 Principles of Accounting II (3) Prerequisite: ACCT 220
ASTD 485 Issues in East Asian Studies (3) (Intended as a final capstone course to be taken in a student's last 15 credits.) Prerequisites: ASTD 284 (or ASTD 150) and 285 (or ASTD 160).
ASTR 100 Introduction to Astronomy (3) (Not open to students who have taken or are taking any astronomy course numbered 250 or higher. For students not majoring or minoring in a science.) Prerequisite: MATH 012 or higher.
ASTD 380 American Relations with China and Japan: 1740 to Present (3) (Fulfills the general education requirement in the social sciences.) A study of American political, economic, and cultural relations with China and Japan from the American colonial era to modern times'''

courses={}
for line in text.splitlines():
    course=re.match(r'([A-Z]{4}\s+\d{3})', line).group(1)
    m=re.search(r'Prerequisites?:\s*(.*)', line)
    if m:
        pre=m.group(1)
    else:
        pre='None'    
    courses[course]=pre

print 'COURSE\t\tPREREQUISITE'    

for course in sorted(courses.keys()):
    print '{}\t{}'.format(course, courses[course]) 

打印:

COURSE      PREREQUISITE
ACCT 221    ACCT 220
ASTD 380    None
ASTD 485    ASTD 284 (or ASTD 150) and 285 (or ASTD 160).
ASTR 100    MATH 012 or higher.