Python3解析并捕获一行中的多个可选参数

时间:2017-05-18 16:22:32

标签: regex python-3.x parsing

我正在解析有关电影的文件。 以下是语言文件的示例:

"!Next?" (1994)                     Italian
"#1 Single" (2006)                  English
"#15SecondScare" (2015)                 English
"#15SecondScare" (2015) {Because We Don't Want You to Fall Asleep 
 (#1.3)}    English
"#15SecondScare" (2015) {Coming and Going (#1.11)}  English
"#Adulthood" (????)                 English
"#Adulting" (2016/I)                    English

如果它是电视节目和每行的语言,我怎么能抓住名字,年份,saison和剧集? 有一些论点并不总是存在(就像它是哪集)?

这是我试过的: for line in file: print(re.findall('"(.*)"', line)) #name print(re.findall(r"\D(\d{4})\D",line)) #year

我已经遇到了多年的麻烦,因为它捕获了剧集编号。 是采取多种模式的方式吗?

感谢。

1 个答案:

答案 0 :(得分:1)

你可以这样做

import re

string = """
"!Next?" (1994)                     Italian
"#1 Single" (2006)                  English
"#15SecondScare" (2015)                 English
"#15SecondScare" (2015) {Because We Don't Want You to Fall Asleep 
 (#1.3)}    English
"#15SecondScare" (2015) {Coming and Going (#1.11)}  English
"#Adulthood" (????)                 English
"#Adulting" (2016/I)                    English
"""

rx = re.compile(r'''
            ^
            "(?P<name>[^"]+)"
            [^(]+\((?P<year>[^)]+)\)
            (?:[^\{^\n]+\{(?P<subtitle>[^}]+)\})?
            \s+(?P<language>[A-Z][a-z]*)
            $
            ''', re.MULTILINE | re.VERBOSE)

movies = [(m.group('name'), m.group('year'), m.group('subtitle'), m.group('language'))
            for m in rx.finditer(string)]
print(movies)
# [('!Next?', '1994', None, 'Italian'), ('#1 Single', '2006', None, 'English'), ('#15SecondScare', '2015', None, 'English'), ('#15SecondScare', '2015', "Because We Don't Want You to Fall Asleep \n (#1.3)", 'English'), ('#15SecondScare', '2015', 'Coming and Going (#1.11)', 'English'), ('#Adulthood', '????', None, 'English'), ('#Adulting', '2016/I', None, 'English')]

查看matches on regex101.com的演示。

<小时/> 一点解释:

  1. 首先,我们在详细多行模式中定义我们的正则表达式模式
  2. 我们使用已编译的模式rx来查找找到的匹配项
  3. 我们将命名组放在结果元组中。
  4. 我们最终得到了一个元组列表