Python正则表达式为多行匹配命名组

时间:2017-06-10 02:58:34

标签: python regex

我有这样的文字

Alabama[STATE]
Auburn (Auburn University)[14]
Florence (University of North Alabama)
Huntsville (University of Alabama, Huntsville)
Jacksonville (Jacksonville State University)[15]
Livingston (University of West Alabama)[15]
Montevallo (University of Montevallo)[15]
Troy (Troy University)[15]
Tuskegee (Tuskegee University)[18]
Alaska[STATE]
Fairbanks (University of Alaska Fairbanks)[15]
Arizona[STATE]
Flagstaff (Northern Arizona University)[19]
Prescott (Embry–Riddle Aeronautical University)
Tempe (Arizona State University)

我正在尝试使用python regex将州和大学列表读入两个命名组。 我的代码是

UNIV_LIST = r"(?P<state>(\w)+)\[.*\n(?P<region>(.*?).*)"
RE_COMMIT = re.compile(UNIV_LIST)
text = open(UFILE).read()
each_group = RE_COMMIT.finditer(text)
for rc in each_group:
    state = rc.groups()[0]
    regions = rc.groups()[1]
    print ('State is %s' %(state))
    print ('regions are %s' %(regions))        

预期输出

State is : Alabama
Regions are : Auburn (Auburn University)[14]
Florence (University of North Alabama)
Huntsville (University of Alabama, Huntsville)
Jacksonville (Jacksonville State University)[15]
Troy (Troy University)[15]
Tuskegee (Tuskegee University)[18]
State is : Alaska
Regions are : Fairbanks (University of Alaska Fairbanks)[15]
State is : Arizona
Regions are : Flagstaff (Northern Arizona University)[19]
Prescott (Embry–Riddle Aeronautical University)
Tempe (Arizona State University)

但目前的输出是

UNIV_LIST = r"(?P<state>(\w+))\[edit\]\n(?P<region>(.*))\n+")

State is Alabama
regions are Auburn (Auburn University)[1]
State is Alaska
regions are Fairbanks (University of Alaska Fairbanks)[2]
State is Arizona
regions are Flagstaff (Northern Arizona University)[6]

有关如何正确获取名为group的区域的任何建议吗?

[编辑] 实际文本是

Alabama[STATE]
Auburn (Auburn University)[14]
Florence (University of North Alabama)
Huntsville (University of Alabama, Huntsville)
Jacksonville (Jacksonville State University)[15]
Livingston (University of West Alabama)[15]
Montevallo (University of Montevallo)[15]
Montgomery (Alabama State University, Huntingdon College, Auburn University at Montgomery, H. Councill Trenholm State Technical College,     Faulkner University)
Troy (Troy University)[15]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[6][17]
Tuskegee (Tuskegee University)[18]
Alaska[STATE]
Fairbanks (University of Alaska Fairbanks)[15]
Arizona[STATE]
Flagstaff (Northern Arizona University)[19]
Prescott (Embry–Riddle Aeronautical University)
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas
Arkadelphia (Henderson State University, Ouachita Baptist University)[15]
Conway (Central Baptist College, Hendrix College, University of Central  Arkansas)[15]
Fayetteville (University of Arkansas)[20]
Jonesboro (Arkansas State University)[21]
Magnolia (Southern Arkansas University)[15]
Monticello (University of Arkansas at Monticello)[15]
Russellville (Arkansas Tech University)[15]
Searcy (Harding University)[18]
California[STATE]

以下正则表达式:

UNIV_LIST = r"(?P<state>^(\w+\[STATE\]))\r?\n?(?P<region>((^[^[]+)(\[\d+\])?(?!\[STATE\])$\r?\n?)+)"

提供了大部分预期结果但缺少某些区域

State is : Alabama
Regions are : Auburn (Auburn University)[14]
Florence (University of North Alabama)
Huntsville (University of Alabama, Huntsville)
Jacksonville (Jacksonville State University)[15]
Livingston (University of West Alabama)[15]
Montevallo (University of Montevallo)[15]
Montgomery (Alabama State University, Huntingdon College, Auburn University at Montgomery, H. Councill Trenholm State Technical College,     Faulkner University)
Troy (Troy University)[15]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[6][17]
Tuskegee (Tuskegee University)[18]
State is : Alaska
Regions are : Fairbanks (University of Alaska Fairbanks)[15]
State is : Arizona
Regions are : Flagstaff (Northern Arizona University)[19]
Prescott (Embry–Riddle Aeronautical University)
Tempe (Arizona State University)

我得到的结果却是

Montgomery (Alabama State University, Huntingdon College, Auburn University at Montgomery, H. Councill Trenholm State Technical College,     Faulkner University)
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[6][17]
Tuskegee (Tuskegee University)[18]

缺少。 关于什么是错的任何建议?

[编辑]

UNIV_LIST = r"(?P<state>^(\w+\s*\w*\[edit\]))\r?\n?(?P<region>((^[^[]+)(\[\d+\]){0,}?(?!\[edit\])$\r?\n?)+)"

这处理状态有两个单词,如新墨西哥州。 但有一个案例仍然失败

Pomona (Cal Poly Pomona, WesternU)[9][10][11] and formerly Pomona College

1 个答案:

答案 0 :(得分:1)

以下正则表达式工作。

UNIV_LIST = r"(?P<state>^(\w+\[STATE\]))\r?\n?(?P<region>((^[^[]+)(\[\d+\]){0,}?(?!\[STATE\])$\r?\n?)+)"
RE_COMMIT = re.compile(UNIV_LIST,re.IGNORECASE | re.MULTILINE)
each_group = RE_COMMIT.finditer(text)
for rc in each_group:
    print('State is : %s' %(rc.group('state')))
    print('Region are : %s' %rc.group('region'))
    print('-'*40)

输出

State is : Alabama[STATE]
Region are : Auburn (Auburn University)[14]
Florence (University of North Alabama)
Huntsville (University of Alabama, Huntsville)
Jacksonville (Jacksonville State University)[15]
Livingston (University of West Alabama)[15]
Montevallo (University of Montevallo)[15]
Troy (Troy University)[15]
Tuskegee (Tuskegee University)[18]

----------------------------------------
State is : Alaska[STATE]
Region are : Fairbanks (University of Alaska Fairbanks)[15]

----------------------------------------
State is : Arizona[STATE]
Region are : Flagstaff (Northern Arizona University)[19]
Prescott (Embry–Riddle Aeronautical University)
Tempe (Arizona State University)
----------------------------------------