Python:混合场景正则表达式

时间:2017-12-16 19:58:36

标签: python regex

我已经和它搏斗了一段时间,并且空手而归。

我将文件解析为pandas数据帧,然后将其转储到mysql中,并且我有一组带有变体的行,如下所示:

523421 F-INV PROC 11/01 01:00:00
634312 MA-BAREAUTH 11/01 01:00:00
523421 MK-PERM YEAR 11/01 01:00:00
123512 G5-FSB 3.00 11/01 01:00:00
864982 JA-PAREN 4.25* 11/01 01:00:00
934821 4.00 11/01 01:00:00
620021 I-MAS DIN 5.25* 11/01 01:00:00
969722 MS-DARE .35 11/01 01:00:00

我试图将每一行转换为4组,而不管组是否为空。到目前为止,我已经在以下正则表达式方面取得了进展,但是如果在日期戳之前没有\ d。\ d {2}匹配,它就无法工作:

(\d{6}).([^\d]*).([^\s][\.]\d+[\*]*).(\d{2}\/\d{2}\s\d{2}\:\d{2}\:\d{2})

这个想法是将一行分组为

969722 MS-DARE 1.35 11/01 01:00:00

像这样:

969722 MS-DARE 1.35 11/01 01:00:00

这适用于诸如此类的行 969722 MS-DARE 1.35 11/01 01:00:00但是当group2中有空格时会中断,例如:

969722 MS-DARE PIN .35 11/01 01:00:00我希望将其分组为969722 MS-DARE PIN 1.35 11/01 01:00:00

总的来说,最终目标是拥有所有这些变体组,例如:

523421 F-INV PROC .95 11/01 01:00:00

634312 MA-BAREAUTH 11/01 01:00:00

523421 MK-PERM YEAR 11/01 01:00:00

123512 G5-FSB 3.00 11/01 01:00:00

864982 JA-PAREN 4.25* 11/01 01:00:00

934821 4.00 11/01 01:00:00

620021 I-MAS DIN 5.25* 11/01 01:00:00

969722 MS-DARE .35 11/01 01:00:00

如何考虑所有这些变化,以便我总是有4个组,如果有3.00或.35这样的数量,那么它是第3组还是空的?

更新:

https://regex101.com/r/lL8rIj/1/

靠近这里,但如果没有任何金额,我需要每场比赛一个空组3 ..

3 个答案:

答案 0 :(得分:1)

你可以试试这个:

from django.contrib import admin
from django import forms
from KSUvity.models import Activity


class ActivityForm(forms.ModelForm):

    class Meta:
        model = Activity
        exclude = ['attendee', 'volunteer',]

class ActivityAdmin(admin.ModelAdmin):
    exclude = ['attendee', 'volunteer',]
    form = ActivityForm

admin.site.register(Activity, ActivityAdmin)

输出:

import re
s = ['523421 F-INV PROC 11/01 01:00:00', '634312 MA-BAREAUTH 11/01 01:00:00', '523421 MK-PERM YEAR 11/01 01:00:00', '123512 G5-FSB 3.00 11/01 01:00:00', '864982 JA-PAREN 4.25* 11/01 01:00:00', '934821 4.00 11/01 01:00:00', '620021 I-MAS DIN 5.25* 11/01 01:00:00', '969722 MS-DARE .35 11/01 01:00:00']
final_s = [re.split('\s(?=[\d\W])|(?<=[\d\W])\s', i) for i in s]

答案 1 :(得分:1)

似乎你可以使用

^                                      # start of line
(?P<group1>\d+)\s                      # capture numbers, match whitespace
(?P<group2>(?:(?!\d*\.\d+|\d{2}/).)+)? # capture as long as the formats 
                                       # of group 3 and 4 are not met  
                                       # the group is optional
(?P<group3>\d*\.\d+\*?)?\s+            # format of group 3...
(?P<group4>\d+/\d+.+)                  # ... and 4 respectively
$                                      # end of line

a demo on regex101.com

<小时/> 在Pythonpandas中,这将是:

import re, pandas as pd

string = """
864982 JA-PAREN 4.25* 11/01 01:00:00
934821 4.00 11/01 01:00:00
620021 I-MAS DIN 5.25* 11/01 01:00:00
969722 MS-DARE AUT .35 11/01 01:00:00
523421 F-INV PROC 11/01 01:00:00
634312 MA-BAREAUTH 11/01 01:00:00
523421 MK-PERM YEAR 11/01 01:00:00
123512 G5-FSB 3.00 11/01 01:00:00
"""

rx = re.compile(r'''
    ^
    (?P<group1>\d+)\s
    (?P<group2>(?:(?!\d*\.\d+|\d{2}/).)+)?
    (?P<group3>\d*\.\d+\*?)?\s+
    (?P<group4>\d+/\d+.+)
    $''', re.VERBOSE | re.MULTILINE)

records = ((m.group(1), m.group(2).rstrip() if m.group(2) else None, 
            m.group(3), m.group(4)) 
            for m in rx.finditer(string))

df = pd.DataFrame(records)
print(df)

<小时/> 这产生了

        0             1      2               3
0  864982      JA-PAREN  4.25*  11/01 01:00:00
1  934821          None   4.00  11/01 01:00:00
2  620021     I-MAS DIN  5.25*  11/01 01:00:00
3  969722   MS-DARE AUT    .35  11/01 01:00:00
4  523421    F-INV PROC   None  11/01 01:00:00
5  634312   MA-BAREAUTH   None  11/01 01:00:00
6  523421  MK-PERM YEAR   None  11/01 01:00:00
7  123512        G5-FSB   3.00  11/01 01:00:00

答案 2 :(得分:1)

我想提议the next solution

import re

data = """
864982 JA-PAREN 4.25* 11/01 01:00:00
934821 4.00 11/01 01:00:00
620021 I-MAS DIN 5.25* 11/01 01:00:00
969722 MS-DARE AUT .35 11/01 01:00:00
969722 MS-DARE 99/99 AUT .35 11/01 01:00:00
523421 F-INV PROC 11/01 01:00:00
634312 MA-BAREAUTH 11/01 01:00:00
523421 MK-PERM YEAR 11/01 01:00:00
523421 MK-PERM 3. YEAR 11/01 01:00:00
123512 G5-FSB 3.00 11/01 01:00:00
"""

rx = re.compile(r"""
  ^
  (\d+)
  (?:
    \s([a-z].*[a-z])
    \s(\d?\.\d+)\*?\s
   |(?:
      \s([a-z].*[a-z])(\s)
     |(\s)(\d*\.\d+)\*?\s
    )
  )
  (\d\d(?:[/\s:]\d\d){4})
  $
""", re.I | re.M | re.X)

for m in rx.finditer(data):
  print(tuple(e for e in m.groups() if e))

结果:

('864982', 'JA-PAREN', '4.25', '11/01 01:00:00')
('934821', ' ', '4.00', '11/01 01:00:00')
('620021', 'I-MAS DIN', '5.25', '11/01 01:00:00')
('969722', 'MS-DARE AUT', '.35', '11/01 01:00:00')
('969722', 'MS-DARE 99/99 AUT', '.35', '11/01 01:00:00')
('523421', 'F-INV PROC', ' ', '11/01 01:00:00')
('634312', 'MA-BAREAUTH', ' ', '11/01 01:00:00')
('523421', 'MK-PERM YEAR', ' ', '11/01 01:00:00')
('523421', 'MK-PERM 3. YEAR', ' ', '11/01 01:00:00')
('123512', 'G5-FSB', '3.00', '11/01 01:00:00')