根据开始日期和结束日期列表选择行

时间:2019-09-22 15:40:59

标签: python-3.x

getWeekdayShort(int weekday){ DateTime date = DateTime.now(); return DateFormat('E').format(date); } DateTime, units 2019-04-04 13:44:48, 15 2019-04-05 13:44:49, 95 2019-04-06 13:44:50, 16 2019-04-07 13:44:51, 23 2019-04-09 13:44:53, 17 2019-04-10 13:44:53, 54 2019-04-11 13:44:53, 14 2019-04-12 13:44:53, 53 2019-04-13 13:44:53, 82 2019-04-14 13:44:53, 25 2019-04-15 13:44:53, 66 2019-04-16 13:44:53, 2 2019-04-17 13:44:53, 44 2019-04-18 13:44:53, 85 2019-04-19 13:44:53, 28 2019-04-20 13:44:53, 20 2019-04-21 13:44:53, 99 2019-04-22 13:44:53, 41 2019-04-23 13:44:53, 3 2019-04-24 13:44:53, 36 2019-04-25 13:44:53, 26 2019-04-26 13:44:53, 30

我有一个较大的csv文件(> 5GB)以及开始日期和结束日期列表。我想根据开始日期和结束日期列表在数据框中选择行。结束日期和开始日期不重叠。

对于上面的样本,结果将是

Start, End 2019-04-01 00:00:00, 2019-04-06 00:00:00 2019-04-09 00:00:00, 2019-04-11 00:00:00 2019-04-18 00:00:00, 2019-04-21 00:00:00

我可以使用for循环来做到这一点,但如果可能的话,希望有一些更有效的方法。

1 个答案:

答案 0 :(得分:0)

我建议您使用pandas。首先,您需要获取数据:

import pandas as pd
import datetime as dt

df = pd.read_csv("path-to-your-csv-file/yourfile.csv") # read your file to df

start = "2019-04-07 00:00:00" # example start date string converted
end = "2019-04-11 00:00:00" # example ending date string

to_datetime = lambda x: dt.datetime.strptime(x, "%Y-%m-%d %H:%M:%S") # format a string to a datetime object

df['DateTime'] = df.DateTime.apply(to_datetime) #convert the column entries from strings to datetime objects

# convert start and end date strings to date time objects
start = to_datetime(start) 
end = to_datetime(end)

几乎在任何情况下都需要to_datetime函数,因为让DateTime列保存datetime对象确实很方便。最简单的情况是,您不在乎时间,而我们认为日期是有效的:

df.DateTime = df.DateTime.dt.date # get rid of the timestamps
start_index = df[df.DateTime == start.date()].index[0] # get the index of the first column where DateTime == start
end_index = df[df.DateTime == end.date()].index[-1] # get the index of the last column where DateTime == end

target = df[start_index:end_index + 1] # save a subset of df matching your criteria to target

如果日期无效(例如, eg ,因为您想有几个小时,但无法精确指定它们),则可以使用searchsorted来获取索引:

start_index = df.DateTime.searchsorted(start)[0] # get the first index where DateTime is closest to start
end_index = df.DateTime.searchsorted(end)[-1] # get the latest index where DateTime is closest to end

target = df[start_index:end_index + 1] # save a subset of df matching your criteria to target

不过,在使用searchsorted时要小心,不要忘记省略删除时间戳的步骤,并确保DateTime列已排序。最后,由于"2019-04-07 23:55:00"将更接近"2019-04-09 13:00:00",而"2019-04-07 10:55:00"将更接近"2019-04-07 13:00:00"-换句话说,时间戳很重要。