我的输入数据集如下-
INPUT = [
'ABCD , D.O.B: - Jun/14/1999.',
'EFGH , DOB; - Jan/10/1998,',
'IJKL , D-O-B - Jul/15/1985..',
'MNOP , (DOB)* - Dec/21/1999,',
'QRST , *DOB* - Apr/01/2000.',
'UVWX , D O B, - Feb/11/2001 '
]
我希望它采用以下格式的输出格式-
OUTPUT = [
('ABCD, Jun/14/1999'),
('EFGH, Jan/10/1998'),
('IJKL, Jul/15/1985'),
('MNOP, Dec/21/1999'),
('QRST, Apr/1/2000'),
('UVWX, Feb/11/2001')
]
我尝试了下面的部分起作用的代码,但无法以所需的OUTPUT格式进行格式化-
import re
INPUT = [
'ABCD , D.O.B: - Jun/14/1999.',
'EFGH , DOB; - Jan/10/1998,',
'IJKL , D-O-B - Jul/15/1985..',
'MNOP , (DOB)* - Dec/21/1999,',
'QRST , *DOB* - Apr/01/2000.',
'UVWX , D O B, - Feb/11/2001 '
]
def formatted_def(input):
for n in input:
t = re.sub('[^a-zA-Z0-9 ]+','',n).split('DOB')
print(t)
formatted_def(INPUT)
输出-
['ABCD ', ' Jun141999']
['EFGH ', ' Jan101998']
['IJKL ', ' Jul151985']
['MNOP ', ' Dec211999']
['QRST ', ' Apr012000']
['UVWX D O B Feb112001 ']
任何指针都将非常有帮助。预先感谢!
答案 0 :(得分:2)
您可以使用select t.*,
(case when seqnum = 1 then 'ADD' else 'CHANGE' end) as audit,
(case when seqnum = 1 then 'NEW'
when seqnum_day = 1 then 'CURRENT'
else 'BEFORE'
end) as history
from (select t.*,
row_number() over (partition by custname order by recordedtime) as seqnum,
row_number() over (partition by custname, cast(recordedtime as date) order by recordedtime desc) as seqnum_day
from t
) t;
:
re.findall
输出:
import re
l = ['ABCD , D.O.B: - Jun/14/1999.', 'EFGH , DOB; - Jan/10/1998,', 'IJKL , D-O-B - Jul/15/1985..', 'MNOP , (DOB)* - Dec/21/1999,', 'QRST , *DOB* - Apr/01/2000.', 'UVWX , D O B, - Feb/11/2001 ']
final_data = [', '.join(re.findall('^\w+|[a-zA-Z]+/\d+/\d+(?=\W)', i)) for i in l]
答案 1 :(得分:2)
除了其他答案,您还可以使用re.sub
:
INPUT = [
'ABCD , D.O.B: - Jun/14/1999.',
'EFGH , DOB; - Jan/10/1998,',
'IJKL , D-O-B - Jul/15/1985..',
'MNOP , (DOB)* - Dec/21/1999,',
'QRST , *DOB* - Apr/01/2000.',
'UVWX , D O B, - Feb/11/2001 '
]
pattern = r'(?i)^([a-z]+).*([a-z]{3}/\d{2}/\d{4}).*$'
OUTPUT = [re.sub(pattern, r'\1, \2', x) for x in INPUT]
# OUTPUT:
[
'ABCD, Jun/14/1999',
'EFGH, Jan/10/1998',
'IJKL, Jul/15/1985',
'MNOP, Dec/21/1999',
'QRST, Apr/01/2000',
'UVWX, Feb/11/2001'
]
答案 2 :(得分:2)
import re
re.findall(r'(\w+)\s+,.*?-\s+([^., ]*)', ' '.join(INPUT))
# [('ABCD', 'Jun/14/1999'), ('EFGH', 'Jan/10/1998'), ('IJKL', 'Jul/15/1985'), ('MNOP', 'Dec/21/1999'), ('QRST', 'Apr/01/2000'), ('UVWX', 'Feb/11/2001')]
答案 3 :(得分:0)
主要困难是获取('ABCD, Jun/14/1999'),
内容。
它不能是单元素元组,因为它会被打印
为('ABCD, Jun/14/1999',),
(在,
之前注意额外的)
)。
因此要准确获得您想要的结果,我使用
一系列print
语句。
整个脚本(在Python 3中)可以如下:
import re
input = [
'ABCD , D.O.B: - Jun/14/1999.',
'EFGH , DOB; - Jan/10/1998,',
'IJKL , D-O-B - Jul/15/1985..',
'MNOP , (DOB)* - Dec/21/1999,',
'QRST , *DOB* - Apr/01/2000.',
'UVWX , D O B, - Feb/11/2001 '
]
result = [ re.sub(r'^([a-z]+).*? - ([a-z]{3}/\d{2}/\d{4}).*',
r'\1, \2', txt, flags = re.IGNORECASE) for txt in input ]
print('OUTPUT = [')
for txt in result:
print(" ('{}')".format(txt))
print(']')