Question

我有一个文件，如果要满足两个条件，我想将一些行追加到空列表中：

我只选择country_code中也存在my_countrycodes的行，并且
对于每个country_code，如果日期时间为<{my_time1

请注意，每行的country_code在文件的[1]处建立索引，每行的日期时间是一个名为date_time4的变量。

这是我的代码：

my_time = '2020-09-06 16:00:45'
my_time1 =  datetime.datetime.strptime(my_time, '%Y-%m-%d %H:%M:%S') 

my_countrycodes = ['555', '256', '1000']

all_row_times = [] #<--- this is the list where we will append the datetime values of the file
new_list = [] #<--- this is the final list where we will append our results
    
with open(root, 'r') as out:
    reader = csv.reader(out, delimiter = '\t')
    for row in reader:  
        # print(row)
        date_time1 = row[-2] + row[-1] #<--- concatenate date + time
        date_time2 = datetime.datetime.strptime(date_time1, '%d-%m-%Y%H:%M:%S') #<--- make a datetime object of the string
        date_time3 = datetime.datetime.strftime(date_time2, '%Y-%m-%d %H:%M:%S') #<--- turn the datetime object  back to a string
        date_time4 = datetime.datetime.strptime(date_time3, '%Y-%m-%d %H:%M:%S') #<--- turn the string object  back to a datetime object
        all_row_times.append(date_time4) #<--- put all the datetime objects into a list.
        
        if any(country_code in row[1] for country_code in my_countrycodes) and date_time4 == max(dt for dt in all_row_times if dt <  my_time1): 
            new_list.append(row) #<-- append the rows with the same country_code in my_countrycodes and the latest time if that time is < my_time1
                
print(new_list)

文件的外观如下： enter image description here

这是new_list的输出：

[['USA', '555', 'White', 'True', 'NY', '06-09-2020', '10:11:32'], 
['USA', '555', 'White', 'True', 'BS', '06-09-2020', '10:11:32'], 
['EU', '256', 'Blue', 'False', 'BR', '06-09-2020', '11:26:21'], 
['GE', '1000', 'Green', 'True', 'BE', '06-09-2020', '14:51:45'], 
['GE', '1000', 'Green', 'True', 'BE', '06-09-2020', '15:59:45']]

如您所见，代码使用country_codes 555，256和1000提取行，它还提取了小于<{{1 }}。因此，这部分工作完美。但是，my_time1行有2个不同的日期时间，我不明白为什么它不只需要MAX日期时间。

这是1000的预期输出：

new_list

Answer 1

实际上，它只需要MAX日期时间，但是在for循环中，14:51:45是第一个。您的代码将其与其他代码进行比较，由于尚未出现其他值，因此将其作为最大值。

在下一次迭代中，还会出现另一个国家/地区代码，并且由于它的时间比其他国家/地区的时间长，因此该行也会追加。我猜这就是你所缺少的。

您可以尝试这样的事情。

my_time =  datetime.datetime.strptime('2020-09-06 16:00:45', '%Y-%m-%d %H:%M:%S')
my_countrycodes = ['555', '256', '1000']

country_code_max_date_rel = {}
matched_rows = []
with open(root, 'r') as out:
    reader = csv.reader(out, delimiter = '\t')
    for row in reader:
        date_time = datetime.datetime.strptime(row[-2] + row[-1], '%d-%m-%Y%H:%M:%S')
        if any(country_code in row[1] for country_code in my_countrycodes):
            matched_rows.append(row)
            try:
                if country_code_max_date_rel[str(row[1])] < date_time:
                    raise KeyError
            except KeyError:
                country_code_max_date_rel[str(row[1])] = date_time

这时，您对每个国家/地区都有最大值。以及行列表。如果您再次像这样进行过滤；

new_list = []
for row in matched_rows:
    country_code = row[1]
    date_time = datetime.datetime.strptime(row[-2] + row[-1], '%d-%m-%Y%H:%M:%S')
    if date_time == country_code_max_date_rel[country_code]:
        if date_time < my_time:
            new_list.append(row)

新列表：

[['USA', '555', 'White', 'True', 'NY', '06-09-2020', '10:11:32'],
 ['USA', '555', 'White', 'True', 'BS', '06-09-2020', '10:11:32'],
 ['EU', '256', 'Blue', 'False', 'BR', '06-09-2020', '11:26:21'],
 ['GE', '1000', 'Green', 'True', 'BE', '06-09-2020', '15:59:45']]

这段代码并不是很好，但是我想它将帮助您更新代码。

Answer 2

抱歉，我不确定您要在这里做什么。假设您只想在contrycode中拥有一个new_list实例，且其最新时间在my_tim1之前，这是一个答案：

您代码中的逻辑不正确。现在，您正在遍历csv文件中的所有行，并在将新行附加到new_list之前应用相同的条件。
在给定的情况下，添加了['GE', '1000', 'Green', 'True', 'BE', '06-09-2020', '15:59:45']，因为条件1为True（my_countrycodes中为1000），条件2也为True（'06-09-2020', '15:59:45'小于my_time1，它是new_list中的“最长”时间也是如此）。

您可以通过许多不同的方式来解决这个问题，但是这里有一些建议：

更改以下解决方案：
检查row[1]是否在str(my_countrycodes)中，
检查行时间是否少于my_time1
检查行的国家/地区代码是否已经在new_list上，
如果不在new_list中，请添加，
如果是new_list，请检查新日期和时间是否符合您的条件，如果是，请更新此行的日期和时间列。
按国家/地区代码过滤文件，然后从每个国家/地区代码的过滤结果中检索最大值

请注意，您的关键是什么，因为您拥有countrycode，它会用不同的参数重复出现。（“ NY”，“ BS”）

建议和评论：

要快速访问数据，可以使用字典。使用国家/地区代码作为密钥可以使您轻松访问数据，并帮助您快速检查数据是否存在并更新其参数。
any(country_code in row[1] for country_code in my_countrycodes)
可以写为：
row[1] in str(my_countrycodes)
或者您甚至可以在输入之前创建my_country_code_str = str(my_countrycodes) for循环。
我不知道您为什么要来回转换日期时间，但是您只需要最后一个就可以了：
rows_date_time = datetime.datetime.strptime(row[-2] + row[-1], '%d-%m-%Y%H:%M:%S')
请记住，您可以使用'%d-%m-%Y%H:%M:%S'
请记住为变量赋予有意义的名称，并为代码保留一种编码标准（例如，当您使用下划线时，请依次使用它）

如何根据日期时间过滤文件？

2 个答案: