我已经和它搏斗了一段时间,并且空手而归。
我将文件解析为pandas数据帧,然后将其转储到mysql中,并且我有一组带有变体的行,如下所示:
523421 F-INV PROC 11/01 01:00:00
634312 MA-BAREAUTH 11/01 01:00:00
523421 MK-PERM YEAR 11/01 01:00:00
123512 G5-FSB 3.00 11/01 01:00:00
864982 JA-PAREN 4.25* 11/01 01:00:00
934821 4.00 11/01 01:00:00
620021 I-MAS DIN 5.25* 11/01 01:00:00
969722 MS-DARE .35 11/01 01:00:00
我试图将每一行转换为4组,而不管组是否为空。到目前为止,我已经在以下正则表达式方面取得了进展,但是如果在日期戳之前没有\ d。\ d {2}匹配,它就无法工作:
(\d{6}).([^\d]*).([^\s][\.]\d+[\*]*).(\d{2}\/\d{2}\s\d{2}\:\d{2}\:\d{2})
这个想法是将一行分组为
969722 MS-DARE 1.35 11/01 01:00:00
969722
MS-DARE
1.35
11/01 01:00:00
这适用于诸如此类的行
969722 MS-DARE 1.35 11/01 01:00:00
但是当group2中有空格时会中断,例如:
969722 MS-DARE PIN .35 11/01 01:00:00
我希望将其分组为969722
MS-DARE PIN
1.35
11/01 01:00:00
总的来说,最终目标是拥有所有这些变体组,例如:
523421
F-INV PROC
.95
11/01 01:00:00
634312
MA-BAREAUTH
11/01 01:00:00
523421
MK-PERM YEAR
11/01 01:00:00
123512
G5-FSB
3.00
11/01 01:00:00
864982
JA-PAREN
4.25*
11/01 01:00:00
934821
4.00
11/01 01:00:00
620021
I-MAS DIN
5.25*
11/01 01:00:00
969722
MS-DARE
.35
11/01 01:00:00
如何考虑所有这些变化,以便我总是有4个组,如果有3.00或.35这样的数量,那么它是第3组还是空的?
更新:
https://regex101.com/r/lL8rIj/1/
靠近这里,但如果没有任何金额,我需要每场比赛一个空组3 ..
答案 0 :(得分:1)
你可以试试这个:
from django.contrib import admin
from django import forms
from KSUvity.models import Activity
class ActivityForm(forms.ModelForm):
class Meta:
model = Activity
exclude = ['attendee', 'volunteer',]
class ActivityAdmin(admin.ModelAdmin):
exclude = ['attendee', 'volunteer',]
form = ActivityForm
admin.site.register(Activity, ActivityAdmin)
输出:
import re
s = ['523421 F-INV PROC 11/01 01:00:00', '634312 MA-BAREAUTH 11/01 01:00:00', '523421 MK-PERM YEAR 11/01 01:00:00', '123512 G5-FSB 3.00 11/01 01:00:00', '864982 JA-PAREN 4.25* 11/01 01:00:00', '934821 4.00 11/01 01:00:00', '620021 I-MAS DIN 5.25* 11/01 01:00:00', '969722 MS-DARE .35 11/01 01:00:00']
final_s = [re.split('\s(?=[\d\W])|(?<=[\d\W])\s', i) for i in s]
答案 1 :(得分:1)
似乎你可以使用
^ # start of line
(?P<group1>\d+)\s # capture numbers, match whitespace
(?P<group2>(?:(?!\d*\.\d+|\d{2}/).)+)? # capture as long as the formats
# of group 3 and 4 are not met
# the group is optional
(?P<group3>\d*\.\d+\*?)?\s+ # format of group 3...
(?P<group4>\d+/\d+.+) # ... and 4 respectively
$ # end of line
<小时/>
在Python
和pandas
中,这将是:
import re, pandas as pd
string = """
864982 JA-PAREN 4.25* 11/01 01:00:00
934821 4.00 11/01 01:00:00
620021 I-MAS DIN 5.25* 11/01 01:00:00
969722 MS-DARE AUT .35 11/01 01:00:00
523421 F-INV PROC 11/01 01:00:00
634312 MA-BAREAUTH 11/01 01:00:00
523421 MK-PERM YEAR 11/01 01:00:00
123512 G5-FSB 3.00 11/01 01:00:00
"""
rx = re.compile(r'''
^
(?P<group1>\d+)\s
(?P<group2>(?:(?!\d*\.\d+|\d{2}/).)+)?
(?P<group3>\d*\.\d+\*?)?\s+
(?P<group4>\d+/\d+.+)
$''', re.VERBOSE | re.MULTILINE)
records = ((m.group(1), m.group(2).rstrip() if m.group(2) else None,
m.group(3), m.group(4))
for m in rx.finditer(string))
df = pd.DataFrame(records)
print(df)
<小时/> 这产生了
0 1 2 3
0 864982 JA-PAREN 4.25* 11/01 01:00:00
1 934821 None 4.00 11/01 01:00:00
2 620021 I-MAS DIN 5.25* 11/01 01:00:00
3 969722 MS-DARE AUT .35 11/01 01:00:00
4 523421 F-INV PROC None 11/01 01:00:00
5 634312 MA-BAREAUTH None 11/01 01:00:00
6 523421 MK-PERM YEAR None 11/01 01:00:00
7 123512 G5-FSB 3.00 11/01 01:00:00
答案 2 :(得分:1)
我想提议the next solution:
import re
data = """
864982 JA-PAREN 4.25* 11/01 01:00:00
934821 4.00 11/01 01:00:00
620021 I-MAS DIN 5.25* 11/01 01:00:00
969722 MS-DARE AUT .35 11/01 01:00:00
969722 MS-DARE 99/99 AUT .35 11/01 01:00:00
523421 F-INV PROC 11/01 01:00:00
634312 MA-BAREAUTH 11/01 01:00:00
523421 MK-PERM YEAR 11/01 01:00:00
523421 MK-PERM 3. YEAR 11/01 01:00:00
123512 G5-FSB 3.00 11/01 01:00:00
"""
rx = re.compile(r"""
^
(\d+)
(?:
\s([a-z].*[a-z])
\s(\d?\.\d+)\*?\s
|(?:
\s([a-z].*[a-z])(\s)
|(\s)(\d*\.\d+)\*?\s
)
)
(\d\d(?:[/\s:]\d\d){4})
$
""", re.I | re.M | re.X)
for m in rx.finditer(data):
print(tuple(e for e in m.groups() if e))
结果:
('864982', 'JA-PAREN', '4.25', '11/01 01:00:00')
('934821', ' ', '4.00', '11/01 01:00:00')
('620021', 'I-MAS DIN', '5.25', '11/01 01:00:00')
('969722', 'MS-DARE AUT', '.35', '11/01 01:00:00')
('969722', 'MS-DARE 99/99 AUT', '.35', '11/01 01:00:00')
('523421', 'F-INV PROC', ' ', '11/01 01:00:00')
('634312', 'MA-BAREAUTH', ' ', '11/01 01:00:00')
('523421', 'MK-PERM YEAR', ' ', '11/01 01:00:00')
('523421', 'MK-PERM 3. YEAR', ' ', '11/01 01:00:00')
('123512', 'G5-FSB', '3.00', '11/01 01:00:00')