如何基于日期时间过滤csv文件?

时间:2020-10-01 07:13:54

标签: python csv datetime

我有一个csv文件,看起来像这样(显然实际上它更大):

1,$1,AA,GG,DD,2020-01-01T00:01:10.740+02:00
2,$2,A1,FD,HH,2020-01-01T00:02:00.240+02:00
3,$3,1A,PP,LL,2020-01-01T00:03:30.460+02:00
4,$4,S1,LL,SS,2020-02-01T00:01:11.190+02:00
5,$5,2G,PP,FF,2020-01-01T00:04:20.320+02:00
6,$6,5S,LL,TT,2020-02-01T01:02:15.180+02:00

我需要记录第一行,记录该日期,并检查其余各行是否等于当天并在0:00:00.000小时至23:59:59.999小时之间。这么说就简单了。我将第一行的日期内的所有行都取了。

这是我想要的结果:

1,$1,AA,GG,DD,2020-01-01T00:01:10.740+02:00
2,$2,A1,FD,HH,2020-01-01T00:02:00.240+02:00
3,$3,1A,PP,LL,2020-01-01T00:03:30.460+02:00
5,$5,2G,PP,FF,2020-01-01T00:04:20.320+02:00

这是我的代码:

root = r'c:\data\FF\Desktop\my_files\file01.txt'

with open(root, 'r') as my_file:
    reader = csv.reader(my_file)
        
def filter_row():
    for row in reader:
        date_time = row[5]   #<--- extract the datetime 
        fdate_time = datetime.strptime(date_time, '%Y-%m-%dT%H:%M:%S.%f%z') #<--- make a datetime object of it
        x = fdate_time.date() #<--- extract the y/m/d

        begin_time = datetime.strptime(x + '00:00.00+02:00','%Y-%m-%dT%H:%M:%S.%f%z') #<--- fix the start time of a day
        end_time = datetime.strptime(x + '23:59:59.999+02:00', '%Y-%m-%dT%H:%M:%S.%f%z') #<--- fix the end time of a day
        
        filtered_records = fdate_time >= begin_time and fdate_time <= end_time #<filter everything between the start and end time
        
    return filtered_records
        
filter_row() 
 

当我运行上面的代码时,我收到:

  File "C:\data\FF\Desktop\Python\My_python\Filter_csv.py", line 82, in filter_row
    for row in reader:

ValueError: I/O operation on closed file.

我真的丢失了它,因为我不知道如何解决。我寻找了多种解决方案,但找不到任何解决方案。希望有人能告诉我并告诉我它是如何工作的。谢谢大家。

2 个答案:

答案 0 :(得分:1)

with提供的上下文管理可确保在块末释放资源。这意味着应该在with组内部 中读取所有内容。

一种简单的方法是对该函数进行参数化:

root = r'c:\data\FF\Desktop\my_files\file01.txt'

def filter_row(reader):
    for row in reader:
        ...            
    return filtered_records

with open(root, 'r') as my_file:
    reader = csv.reader(my_file)
    filter_row(reader)

但是:

  • 您应该使用datetime.replace方法来计算一天的开始和结束,而不要使用字符串
  • 如果要将这些行写入新文件,则应将filter row更改为生成器:
root = r'c:\data\FF\Desktop\my_files\file01.txt'
newf= r'c:\data\FF\Desktop\my_files\file01.csv'

def filter_row(reader):
    first = True
    for row in reader:
        date_time = row[5]  # <--- extract the datetime
        fdate_time = datetime.strptime(date_time, '%Y-%m-%dT%H:%M:%S.%f%z')  # <--- make a datetime object of it

        if first:         # special processing for the first line
            first = False
            begin_time = fdate_time.replace(hour=0, minute=0, second=0, microsecond=0) # <--- fix the start time of a day
            end_time = fdate_time.replace(hour=23, minute=59, second=59, microsecond=999999) # <--- fix the end time of a day
            yield row      # yield first row
        elif fdate_time >= begin_time and fdate_time <= end_time:  # <filter everything between the start and end time
            yield row      # and rows of same date

with open(root) as my_file, open(newf, 'w', newline=None) as new_file:
    reader = csv.reader(my_file)
    writer = csv.writer(new_file)

    writer.writerows(filter_row(reader))

答案 1 :(得分:0)

我建议您使用熊猫来做。

  1. 在数据框中使用熊猫读取文件
  2. 然后将行限制为date的第一个值(过滤记录并放入其他数据框)
  3. 新数据框将具有所需的输出

熊猫也将为您提供轻松的可伸缩性,以防将来文件大小增加。