Python regex group parsing of input with multiple groups that may or may not be present

时间:2017-08-13 13:59:24

标签: python regex web-scraping data-cleaning

If I have a file of inputs with loose form (when I say loose form I mean that not all lines contain all information as is explained later on):

23 1990-10-10 Clark Kent

And I want to define a group for age, date, and name, how do I go about extracting these into a named groupdict() such as

{ age: 23, date: '1990-10-10', name: 'Clark Kent' }

If fields age or date are missing, such as:

1990-10-10 Clark Kent

or

23 Clark Kent

The groups should still be able to be parsed and return None for the fields that it couldn't find.

{ age: 23, date: None, name: 'Clark Kent' }

Now:

re.match(r'(?P<age>[0-9]+)?\s*(?P<birthday>\d\d\d\d\-\d\d\-\d\d)?\s*(?P<name>(\w|\s)+)',
 "23 1990-10-10 Clark Kent")

Returns the desired output.

When however the testing string is:

"1990-10-10 Clark Kent"

Then the age parameter grabs the initial 199 greedily and the birthday fails to be parsed correctly.

How would you go about parsing this file to permissively grab whatever fields can be grabbed?

2 个答案:

答案 0 :(得分:3)

通过将组连接到非捕获组(如

)中,使组可选,包括空格分隔符
(?:(?P<age>[0-9]+) +)?(?:(?P<birthday>\d\d\d\d\-\d\d\-\d\d) +)?(?P<name>[\w ]+)

请参阅https://regex101.com/r/a41VTh/1

由于\s也匹配换行符,我使用了文字空格作为分隔符以避免行重叠匹配。如果需要,您可能需要添加标签[ \t]

答案 1 :(得分:1)

没有要求这样做。但是,我认为值得一提的是, pyparsing 通常是应该考虑的正则表达式的更容易的替代方案。

我首先为你的输入定义一个语法。

  • 日期是由连字符分隔的一系列数字。
  • 年龄最多为两位数。
  • 名称是带有散布空格的字母字符。 (这应该用连字符和撇号来改进。)

我通过说可选年龄(隐式)后跟可选日期后跟必需名称来制作完整的whole

我认为你会同意这与正则表达式相比相当简单。

例如('age')之类的东西会安排将解析出的项目保存为代码中的检索结果。

>>> import pyparsing as p
>>> date = p.Word(p.nums+'-')
>>> age = p.Word(p.nums, max=2)
>>> name = p.Word(p.alphas+' ')
>>> whole = p.Optional(age)('age') + p.Optional(date)('date') + name('name')

现在我可以对你的字符串练习这个语法了。如上所述,result的行为类似于dict。我包含从每个字符串中解析的任何项目。

>>> result = whole.parseString('23 1990-10-10 Clark Kent')
>>> [result[_] for _ in ['age', 'date', 'name'] if _ in result]
['23', '1990-10-10', 'Clark Kent']
>>> result = whole.parseString('1990-10-10 Clark Kent')
>>> [result[_] for _ in ['age', 'date', 'name'] if _ in result]
['1990-10-10', 'Clark Kent']
>>> result = whole.parseString('23 Clark Kent')
>>> [result[_] for _ in ['age', 'date', 'name'] if _ in result]
['23', 'Clark Kent']