我正在解析有关电影的文件。 以下是语言文件的示例:
"!Next?" (1994) Italian
"#1 Single" (2006) English
"#15SecondScare" (2015) English
"#15SecondScare" (2015) {Because We Don't Want You to Fall Asleep
(#1.3)} English
"#15SecondScare" (2015) {Coming and Going (#1.11)} English
"#Adulthood" (????) English
"#Adulting" (2016/I) English
如果它是电视节目和每行的语言,我怎么能抓住名字,年份,saison和剧集? 有一些论点并不总是存在(就像它是哪集)?
这是我试过的:
for line in file:
print(re.findall('"(.*)"', line)) #name
print(re.findall(r"\D(\d{4})\D",line)) #year
我已经遇到了多年的麻烦,因为它捕获了剧集编号。 是采取多种模式的方式吗?
感谢。
答案 0 :(得分:1)
你可以这样做
import re
string = """
"!Next?" (1994) Italian
"#1 Single" (2006) English
"#15SecondScare" (2015) English
"#15SecondScare" (2015) {Because We Don't Want You to Fall Asleep
(#1.3)} English
"#15SecondScare" (2015) {Coming and Going (#1.11)} English
"#Adulthood" (????) English
"#Adulting" (2016/I) English
"""
rx = re.compile(r'''
^
"(?P<name>[^"]+)"
[^(]+\((?P<year>[^)]+)\)
(?:[^\{^\n]+\{(?P<subtitle>[^}]+)\})?
\s+(?P<language>[A-Z][a-z]*)
$
''', re.MULTILINE | re.VERBOSE)
movies = [(m.group('name'), m.group('year'), m.group('subtitle'), m.group('language'))
for m in rx.finditer(string)]
print(movies)
# [('!Next?', '1994', None, 'Italian'), ('#1 Single', '2006', None, 'English'), ('#15SecondScare', '2015', None, 'English'), ('#15SecondScare', '2015', "Because We Don't Want You to Fall Asleep \n (#1.3)", 'English'), ('#15SecondScare', '2015', 'Coming and Going (#1.11)', 'English'), ('#Adulthood', '????', None, 'English'), ('#Adulting', '2016/I', None, 'English')]
查看matches on regex101.com的演示。
<小时/> 一点解释:
rx
来查找找到的匹配项