我在Excel中有一列“日期”,这是由不同的人多年来收集的,并且我想在熊猫数据框中从该“脏”中创建一个适当的日期列,并带有开始日期和结束日期列。 这个肮脏的专栏里充斥着垃圾和句子,要想获取其中的日期/月份/年份是很棘手的。我设法稍微清洁一下,但距离最后还有很长一段路。这是一个“脏日期”列的示例,如下所示:
['Apr 1 - 10 2012', 'Aug 6 - sept 17 2018', 'Jan 2017 (explosion) - dec
31', 'April 15 - Nov 20', "Mar 1 - Jun 30 '12", "Sep 27 - Nov 30,
2012", "Dec 7, 2015 - Feb 15, '16", "June 21 2016 - June 27, 2016",
"July 6 - 13 (est), 2016", "Mar 28 2016 - dec 31 2017",
"Nov 30, 2016 - Aug 31, 2016", "1 oct - 15 oct 2017",
"March 26 to May 3, 2017", "Jan 4,2018",
"Jan 26, 2017 - end of march 2018",
"5 days assumed for Sep '11,
"Sep 15 - Oct 12 '11 (abc 50 n/g) something in 2015 but something
else happened '16",
"Mar 17 - Apr 15, '12 (assumed 1 mo.)",
"Jan 1 - 31 '12 a/b due to many words here and descriptions."]
我有3万多行。 对于开始日期,我尝试首先将它们用'-'分隔,因为它们似乎都具有'to'或'-'并执行以下操作,但效果不佳,我无法涵盖上述所有情况
def start_date(row):
row_start = ''
num_words = len(str(row).split(' '))
# trying to skip long rows with many words but I don't think this is the way to go
if (row is not None) and (type(row) != float) and (type(row) is not datetime.datetime) and (num_words <= 7) and (("'" in row) or ("\"" in row)):
row = row.replace('"', "'")
row = row.replace('!', "1")
row = row.replace("`", "")
row = row.replace("' ", "'")
row = row.replace(", ", " ")
year_start = re.search(r"""(?<=\')(\d\d)""", row)[0]
row = row.replace('to', '-')
row_start = row.split('-')[0]
row_start = row_start.replace(' \"', ',')
row_start = row_start.replace(" \'", ',')
row_start = row_start.replace(",,", ',')
row_start = str(row_start).strip()
return str(row_start) + ', 20' + str(year_start)
感谢您的帮助。
如果任何人都知道适当的正则表达式,那么如何仅覆盖上面的边缘情况也将大有帮助。