Python,使用正则表达式修复日期

时间:2019-02-04 12:02:12

标签: regex python-3.x

我在Excel中有一列“日期”,这是由不同的人多年来收集的,并且我想在熊猫数据框中从该“脏”中创建一个适当的日期列,并带有开始日期和结束日期列。 这个肮脏的专栏里充斥着垃圾和句子,要想获取其中的日期/月份/年份是很棘手的。我设法稍微清洁一下,但距离最后还有很长一段路。这是一个“脏日期”列的示例,如下所示:

['Apr 1 - 10 2012', 'Aug 6 - sept 17 2018', 'Jan 2017 (explosion) - dec 
 31', 'April 15 - Nov 20', "Mar 1 - Jun 30 '12", "Sep 27 - Nov 30, 
 2012", "Dec 7, 2015 - Feb 15, '16", "June 21 2016 - June 27, 2016",
 "July 6 - 13 (est), 2016", "Mar 28 2016 - dec 31 2017", 
 "Nov 30, 2016 - Aug 31, 2016", "1 oct - 15 oct 2017",
 "March 26 to May 3, 2017", "Jan 4,2018", 
 "Jan 26, 2017 - end of march 2018", 
 "5 days assumed for Sep '11, 
 "Sep 15 - Oct 12 '11 (abc 50 n/g) something in 2015 but something 
  else happened '16", 
 "Mar 17 - Apr 15, '12 (assumed 1 mo.)", 
 "Jan 1 - 31 '12 a/b due to many words here and descriptions."] 

我有3万多行。 对于开始日期,我尝试首先将它们用'-'分隔,因为它们似乎都具有'to'或'-'并执行以下操作,但效果不佳,我无法涵盖上述所有情况

def start_date(row):
    row_start = ''
    num_words = len(str(row).split(' '))
    # trying to skip long rows with many words but I don't think this is the way to go
    if (row is not None) and (type(row) != float) and (type(row) is not datetime.datetime) and (num_words <= 7) and (("'" in row) or ("\"" in row)):
       row = row.replace('"', "'")
       row = row.replace('!', "1")
       row = row.replace("`", "")
       row = row.replace("' ", "'")
       row = row.replace(", ", " ")
       year_start = re.search(r"""(?<=\')(\d\d)""", row)[0]
       row = row.replace('to', '-')

       row_start = row.split('-')[0]
       row_start = row_start.replace(' \"', ',')
       row_start = row_start.replace(" \'", ',')
       row_start = row_start.replace(",,", ',')
       row_start = str(row_start).strip()

return str(row_start) + ', 20' + str(year_start)

感谢您的帮助。

如果任何人都知道适当的正则表达式,那么如何仅覆盖上面的边缘情况也将大有帮助。

0 个答案:

没有答案