Python re.findall()返回空列表

时间:2016-11-05 02:26:41

标签: python regex findall

我正在尝试将一些单词与正则表达式匹配,并为此编写了一个python代码。奇怪的是re.findall()在匹配上返回空列表。但是,模式和文本文件在regxr.com中显示匹配项。这是代码

pat1 = '(\S+)_(?:JJ)_\S+\b(?:\s+)(\S+)_(?:NN|NNS)_\S+\b'
pat2 = '(\S+?)_(?:RR|RBR|RBS)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat3 = '(\S+?)_(?:JJ)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat4 = '(\S+?)_(?:NN|NNS)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat5 = '(\S+?)_(?:RB|RBR|RBS)_\S+\b(?:\s+)(\S+?)_(?:VB|VBD|VBN|VBG)_\S+\b(?:\s+)\S*?_\S+?_\S+\b'

def process_file(content):
res = []
for line in content:
    matches = re.findall(pat1,line)
    for m in matches:
        m = (m[0],m[1])
        phrase = '%s %s' % m
        res.append(phrase)
    matches = re.findall(pat2,line)
    for m in matches:
        m = (m[0],m[1])
        phrase = '%s %s' % m
        res.append(phrase)
    matches = re.findall(pat3,line)
    for m in matches:
        m = (m[0],m[1])
        phrase = '%s %s' % m
        res.append(phrase)
    matches = re.findall(pat4,line)
    for m in matches:
        m = (m[0],m[1])
        phrase = '%s %s' % m
        res.append(phrase)
    matches = re.findall(pat5,line)
    for m in matches:
        m = (m[0],m[1])
        phrase = '%s %s' % m
        res.append(phrase)
return res

def main(path):
   contents = []
   f = open(path)
   for line in f:
      contents.append(line)
   f.close()
   result = process_file(contents) 
   print result

这是我正在使用的文本文件:

  

sydney_NN_B-NP lumet_NN_I-NP is_VBZ_B-VP the_DT_B-NP director_NN_I-NP其_WP $ _B-NP work_NN_I-NP happen_VBZ_B-VP to_TO_I-VP be_VB_I-VP of_IN_B-PP vary_VBN_B-NP quality_NN_I-NP ._._ B-O   he_PRP_B-NP is_VBZ_B-VP praised_VBN_I-VP for_IN_B-PP some_DT_B-NP of_IN_B-PP the_DT_B-NP most_RBS_I-NP important_JJ_I-NP films_NNS_I-NP of_IN_B-PP the_DT_B-NP previous_JJ_I-NP decade_NNS_I-NP,_,_ B-O like_IN_B- PP十二_CD_B-NP angry_JJ_I-NP men_NNS_I-NP,_,_ B-O serpico_NN_B-NP or_CC_B-O the_DT_B-NP verdict_NN_I-NP ._._ B-O   but_CC_B-O,_,_ I-O in_IN_B-PP the_DT_B-NP same_JJ_I-NP time_NN_I-NP,_,_ B-O almost_RB_B-NP any_DT_I-NP of_IN_B-PP such_JJ_B-NP pearls_NNS_I-NP is_VBZ_B-VP follow_VBN_I-VP by_IN_B- PP stinkers_NNS_B-NP that_WDT_B-NP hamper_VBP_B-VP lumet' s_JJ_B-NP reputation_NN_I-NP ._._ B-O   a_DT_B-NP stranger_NN_I-NP between_IN_B-PP us_PRP_B-NP,_,_ B-O 1992_CD_B-NP rip-off_NN_I-NP of_IN_B-PP peter_NN_B-NP weir' s_JJ_I-NP witness_NN_I-NP,_,_ B-O belongs_VBZ_B- VP to_TO_B-PP the_DT_B-NP later_NN_I-NP category_NN_I-NP ._._ B-O   the_DT_B-NP heroine_NN_I-NP of_IN_B-PP this_DT_B-NP movie_NN_I-NP is_VBZ_B-VP emily_JJ_B-NP eden_FW_I-NP(_(_ B-O melanie_JJ_B-NP griffith_NN_I-NP)_) BO,,_ I -O tough_JJ_B-NP lady_NN_I-NP cop_NN_I-NP who_WP_B-NP有时_RB_B-ADVP shows_VBZ_B-VP too_RB_B-NP much_JJ_I-NP warmiasm_NN_I-NP in_IN_B-PP battling_VBG_B-VP bad_JJ_B-NP guys_NNS_I-NP on_IN_B-PP the_DT_B-NP streets_NNS_I-NP of_IN_B-PP new_JJ_B-NP york_NN_I-NP ._._ B-O   during_IN_B-PP one_CD_B-NP of_IN_B-PP such_JJ_B-NP actions_NNS_I-NP,_,_ B-O her_PRP $ _B-NP partner_NN_I-NP nick_NN_I-NP(_(_ B-O jamey_JJ_B-NP sheridan_NNS_I-NP)_)_ B-O got_VBD_B-VP hurt_VBN_I-VP and_CC_B-O as_IN_B-PP a_DT_B-NP result_NN_I-NP,_,_ B-O she_PRP_B-NP become_VBZ_B-VP depression_JJ_B-ADJP ._._ B-O   in_IN_B-PP order_NN_B-NP to_TO_B-VP help_VB_I-VP her_PRP_B-NP recover_VB_B-VP,_,_ B-O bosses_NNS_B-NP give_VBP_B-VP her_PRP_B-NP Rather_RB_I-NP easy_JJ_I-NP task_NN_I-NP of_IN_B-PP locating_VBG_B-VP missing_VBG_B- NP jeweller_NNS_I-NP who_WP_B-NP属于_VBD_B-VP to_TO_B-PP hassidic_JJ_B-NP jew_NN_I-NP community_NN_I-NP ._._ B-O   emily_NN_B-NP starts_VBZ_B-VP investigation_NN_B-NP and_CC_B-O soon_RB_B-VP realises_VBZ_I-VP that_IN_B-SBAR the_DT_B-NP case_NN_I-NP涉及_VBZ_B-VP murder_NN_B-NP ._._ B-O   concluding_VBG_B-VP that_IN_B-SBAR the_DT_B-NP perpetrator_NN_I-NP belongs_VBZ_B-VP to_TO_B-PP community_NN_B-NP,_,_ B-O she_PRP_B-NP decides_VBZ_B-VP to_TO_I-VP go_VB_I-VP undercover_JJ_B-ADJP ._._ B-O   那__D_B_B-NP不是t_RB_B-O easy_JJ_B-ADJP,_,_ B-O因为_IN_B-SBAR her_PRP $ _B-NP modern_JJ_I-NP manners_NNS_I-NP are_VBP_B-VP colliding_VBG_I-VP with_IN_B-PP traditionalist_NN_B-NP ways_NNS_I-NP ._ ._B-O   things_NNS_B-NP get_VBP_B-VP even_RB_B-NP more_RBR_B-ADJP complex_JJ_I-ADJP when_WRB_B-ADVP she_PRP_B-NP develop_VBZ_B-VP feelings_NNS_B-NP for_IN_B-PP young_JJ_B-NP cabalistic_JJ_I-NP scholar_NN_I-NP ariel_NN_I-NP(_(_ B-O eric_JJ_B- NP thal_NN_I-NP)_) BO。 ._ I-O   using_VBG_B-VP peter_NN_B-NP weir' s_JJ_I-NP formula_NN_I-NP isn&#t; t:_ B -O the_DT_B-NP great_JJS_I-NP flaw_NN_I-NP of_IN_B-PP this_DT_B-NP film_NN_I-NP ._._ B-O   even_RB_B-NP the_DT_I-NP lame_JJ_I-NP and_CC_I-NP unispiring_JJ_I-NP crime_NN_I-NP mystery_NN_I-NP subplot_NN_I-NP works_VBZ_B-VP to_TO_B-PP the_DT_B-NP certain_JJ_I-NP extent_NN_I-NP ._._ B-O   but_CC_B-O the_DT_B-NP worst_JJS_I-NP insult_NN_I-NP to_TO_B-PP viewer' s_JJ_B-NP audience_NN_I-NP is_VBZ_B-VP terrible_JJ_B-NP miscasting_NN_I-NP of_IN_B-PP melanie_JJ_B-NP griffith_NN_I-NP ._._ B-O   the_DT_B-NP author_NN_I-NP of_IN_B-PP this_DT_B-NP review_NN_I-NP never_RB_B-ADVP likes_VBD_B-VP this_DT_B-NP actress_NN_I-NP very_RB_B-ADVP much_RB_I-ADVP,_,_ B-O but_CC_I-O she_PRP_B-NP was_VBD_B-VP at_IN_B- ADVP least_JJS_I-ADVP tolerable_JJ_B-ADJP in_IN_B-PP some_DT_B-NP of_IN_B-PP her_PRP $ _B-NP roles_NNS_I-NP ._._ B-O   role_NN_B-NP of_IN_B-PP emily_JJ_B-NP eden_NNS_I-NP,_,_ B-O很遗憾_RB_B-ADVP,_,_ B-O isn&t; V_ZB_I-O one_CD_B-NP of_IN_B-PP them_PRP_B-NP ._._ B-O   first_RB_B-ADVP of_IN_B-PP all_DT_B-NP,_,_ B-O she_PRP_B-NP can_t_MD_B-VP pass_VB_I-VP for_IN_B-PP tough_JJ_B-NP nypd_JJ_I-NP street_NN_I-NP fighter_NN_I-NP,_,_ B-O and_CC_I -O her_PRP $ _B-NP attempt_NN_I-NP to_TO_B-VP pass_VB_I-VP for_IN_B-PP orthodox_JJ_B-NP jewish_JJ_I-NP woman_NN_I-NP isn&t; T_RB_B-O much_RB_B-ADJP better_JJR_I-ADJP ._._ B-O   screenplay_NN_B-NP by_IN_B-PP robert_JJ_B-NP j_NN_I-NP ._._ B-O avrech_NNS_B-NP makes_VBZ_B-VP things_NNS_B-NP even_RB_B-ADJP worst_JJR_I-ADJP with_IN_B-PP some_DT_B-NP formulaic_JJ_I-NP red_JJ_I-NP herring_NN_I-NP subplots_NNS_I- NP(_(_ B-O scene_NN_B-NP涉及_VBG_B-VP two_CD_B-NP italian_JJ_I-NP gangsters_NNS_I-NP was_VBD_B-VP almost_RB_B-ADJP too_RB_I-ADJP painful_JJ_I-ADJP to_TO_B-VP watch_VB_I-VP)_) BO。 ._我-O   but_CC_B-O,_,_ I-O on_IN_B-PP the_DT_B-NP other_JJ_I-NP hand_NN_I-NP,_,_ B-O other_JJ_B-NP actors_NNS_I-NP are_VBP_B-VP more_RBR_B-ADJP consecing_JJ_I-ADJP(_(_ B-O lee_NN_B-) NP richardson_NN_I-NP as_IN_B-PP an_DT_B-NP old_JJ_I-NP rabbi_NN_I-NP,_,_ B-O thal_JJ_B-ADJP as_IN_B-PP ariel_NN_B-NP and_CC_B-O charming_JJ_B-NP mia_NN_I-NP sara_NN_I-NP as_IN_B-PP his_PRP $ _B- NP intention_VBN_I-NP bride_NN_I-NP)_) BO,,_ I-O and_CC_I-O the_DT_B-NP photography_NN_I-NP by_IN_B-PP andrzej_JJ_B-NP bartkowiak_NN_I-NP very_RB_B-ADVP effective_RB_I-ADVP creates_VBZ_B-VP atmosphere_NN_B-NP of_IN_B-PP warmth_NN_B-NP when_WRB_B-ADVP the_DT_B-NP scenes_NNS_I-NP take_VBP_B-VP place_NN_B-NP in_IN_B-PP hassidic_JJ_B-NP community_NN_I-NP ._._ B-O   also_RB_B-ADVP,_,_ B-O the_DT_B-NP film_NN_I-NP might_MD_B-VP educate_VB_I-VP viewers_NNS_B-NP about_IN_B-PP hassidic_JJ_B-NP culture_NN_I-NP ._._ B-O   that_DT_B-NP is_VBZ_B-VP the_DT_B-NP only_JJ_I-NP thing_NN_I-NP that_WDT_B-NP prevent_VBZ_B-VP it_PRP_B-NP from_IN_B-PP turning_VBG_B-VP into_IN_B-PP total_JJ_B-NP waste_NN_I-NP of_IN_B-PP time_NN_B-NP ._._ B- 0

1 个答案:

答案 0 :(得分:3)

你被反斜杠咬伤了!反斜杠用作Python字符串中的转义字符(与许多其他语言一样)。例如,\n表示“换行符”,\r表示“回车”...而\b表示“退格”,又名\x08

你的所有表达方式都有\b

所以当你写:

>>> pat1 = '...\b...'

你得到:

>>> pat1
'...\x08...'

有两种方法可以解决这个问题。您可以使用另一个反斜杠来转义每个反斜杠,如下所示:

>>> pat1 = '...\\b...'
>>> pat1
'...\\b...'

请注意,您在那里看到\\,因为这是字符串的Python表示形式;如果我们要打印pat1,我们会得到:

>>> print pat1
...\b...

更容易解决的方法是将正则表达式字符串标记为“原始字符串”:

  

反斜杠()字符用于转义具有特殊含义的字符,例如换行符,反斜杠本身或引号字符。字符串文字可以选择以字母r' or R'为前缀;这些字符串称为原始字符串,并对反斜杠转义序列使用不同的规则。

换句话说:

pat1 = r'(\S+)_(?:JJ)_\S+\b(?:\s+)(\S+)_(?:NN|NNS)_\S+\b'
pat2 = r'(\S+?)_(?:RR|RBR|RBS)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat3 = r'(\S+?)_(?:JJ)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat4 = r'(\S+?)_(?:NN|NNS)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat5 = r'(\S+?)_(?:RB|RBR|RBS)_\S+\b(?:\s+)(\S+?)_(?:VB|VBD|VBN|VBG)_\S+\b(?:\s+)\S*?_\S+?_\S+\b'

随着这种变化,我使用您的样本数据获得匹配:

>>> re.findall(pat1, data)
[('important', 'films'), ('previous', 'decades'), ('angry', 'men'), ('same', 'time'), ('such', 'pearls'), ("lumet's", 'reputation'), ("weir's", 'witness'), ('melanie', 'griffith'), ('tough', 'lady'), ('much', 'enthusiasm'), ('bad', 'guys'), ('new', 'york'), ('such', 'actions'), ('jamey', 'sheridan'), ('easy', 'task'), ('hassidic', 'jew'), ('modern', 'manners'), ('cabalistic', 'scholar'), ('eric', 'thal'), ("weir's", 'formula'), ('unispiring', 'crime'), ('certain', 'extent'), ("viewer's", 'audience'), ('terrible', 'miscasting'), ('melanie', 'griffith'), ('emily', 'eden')]