我有一个.txt填充了我想要过滤的数据(大约5800行),因为有些行是重复的,唯一的区别是时间戳恰好是2小时之后。应该省略那些副本的晚期版本(例如附加示例中的第一行)。所有其他行应保留并写入新的.txt文件。
1_3_IMM 2016-07-19 16:11:56 00:00:40 2 Sensor Check # should go
1_3_IMM 2016-07-19 14:12:40 00:00:33 2 Sensor Check # should stay
1_3_IMM 2016-07-19 14:11:56 00:00:40 2 Sensor Check # should stay
1_3_IMM 2016-07-19 16:12:40 00:00:33 2 Sensor Check # should go
1_4_IMM 2016-07-19 17:23:25 00:00:20 2 Sensor Check # should stay
1_4_IMM 2016-07-19 19:23:25 00:00:20 2 Sensor Check # should go
1_4_IMM 2016-07-19 19:15:24 00:02:21 2 Sensor Check # should stay
1_4_IMM 2016-07-19 19:25:13 00:02:13 2 Sensor Check # should stay
我在Python中编写了一些代码,输出是一个仅包含1行文本的.txt文件:
deleted
我似乎无法解决这个问题。你能帮我吗?见下面的代码。
import os
def filter_file():
with open("output.txt", "w") as output:
#open the input file from a specified directory
directory = os.path.normpath("C:/Users/sande_000/Documents/Python files")
for subdir, dirs, files in os.walk(directory):
for file in files:
if file.startswith("input"):
input_file=open(os.path.join(subdir, file))
#iterate over each line of the file
for line in input_file:
machine = line[0:7] #stores machine number
date = line[8:18] #stores date stamp
time_1 = int(line[19:21]) #stores hour stamp
time_2 = int(line[22:24]) #stores minutes stamp
time_3 = int(line[25:27]) #stores second stamp
#check current line with other lines for duplicates by iterating over each line of the file
for otherline in input_file:
compare_machine = otherline[0:7]
compare_date = otherline[8:18]
compare_time_1 = int(otherline[19:21])+2
compare_time_2 = int(otherline[22:24])
compare_time_3 = int(otherline[25:27])
#check whether machine number & date/hour+2/minutes/seconds stamp are similar.
#If yes, write 'deleted' to output.txt and stop comparing lines.
#If no, continue with comparing next line.
if compare_machine == machine and compare_date == date and compare_time_1 == time_1 and compare_time_2 == time_2 and compare_time_3 == time_3:
output.write("deleted"+"\n")
break
else:
continue
#If no overlap between one line with any other line from the file, write that line to output.txt since it is no duplicate.
output.write(line)
input_file.close()
if __name__ == "__main__":
filter_file()
答案 0 :(得分:1)
我相信下面的代码有效。请注意,如果由于datetime
没有导致记录的最小三个时间成分(毫秒,微秒,纳秒)的任何变化,此代码将无法工作。支持超过微秒的分辨率。在你的例子中,虽然不会有所作为。
import os
from datetime import datetime, timedelta
INPUT_DIR = 'C:\Temp'
OUTPUT_FILE = 'output.txt'
def parse_data(data):
for line in data.splitlines():
date_s = ' '.join(line.split()[1:3])
date = datetime.strptime(date_s, '%Y-%m-%d %H:%M:%S')
yield line, date
def filter_duplicates(data):
duplicate_offset = timedelta(hours=2)
parsed_data = list(parse_data(data))
lines, dates = zip(*parsed_data)
for line, date in parsed_data:
if (date - duplicate_offset) not in dates:
yield line
def get_input_data_from_dir(directory):
data = ''
for sub_dir, _, files in os.walk(directory):
for file in files:
if file.startswith('input'):
with open(os.path.join(sub_dir, file)) as f:
data += f.read() + '\n'
return data
if __name__ == '__main__':
data = get_input_data_from_dir(INPUT_DIR)
with open(OUTPUT_FILE, 'w') as f_out:
content = '\n'.join(filter_duplicates(data))
f_out.write(content)
使用结构测试输入目录:
me@my-computer /cygdrive/c/Temp
$ tree
.
├── input_1.txt
└── input_2.txt
input_1.txt
:
1_3_IMM 2016-07-19 16:11:56 00:00:40 2 Sensor Check
1_3_IMM 2016-07-19 14:12:40 00:00:33 2 Sensor Check
1_3_IMM 2016-07-19 14:11:56 00:00:40 2 Sensor Check
1_3_IMM 2016-07-19 16:12:40 00:00:33 2 Sensor Check
input_2.txt
:
1_4_IMM 2016-07-19 17:23:25 00:00:20 2 Sensor Check
1_4_IMM 2016-07-19 19:23:25 00:00:20 2 Sensor Check
1_4_IMM 2016-07-19 19:15:24 00:02:21 2 Sensor Check
1_4_IMM 2016-07-19 19:25:13 00:02:13 2 Sensor Check
执行后 output.txt
:
1_3_IMM 2016-07-19 14:12:40 00:00:33 2 Sensor Check
1_3_IMM 2016-07-19 14:11:56 00:00:40 2 Sensor Check
1_4_IMM 2016-07-19 17:23:25 00:00:20 2 Sensor Check
1_4_IMM 2016-07-19 19:15:24 00:02:21 2 Sensor Check
1_4_IMM 2016-07-19 19:25:13 00:02:13 2 Sensor Check
下面的预期输出,为方便起见而复制:
1_3_IMM 2016-07-19 16:11:56 00:00:40 2 Sensor Check # should go
1_3_IMM 2016-07-19 14:12:40 00:00:33 2 Sensor Check # should stay
1_3_IMM 2016-07-19 14:11:56 00:00:40 2 Sensor Check # should stay
1_3_IMM 2016-07-19 16:12:40 00:00:33 2 Sensor Check # should go
1_4_IMM 2016-07-19 17:23:25 00:00:20 2 Sensor Check # should stay
1_4_IMM 2016-07-19 19:23:25 00:00:20 2 Sensor Check # should go
1_4_IMM 2016-07-19 19:15:24 00:02:21 2 Sensor Check # should stay
1_4_IMM 2016-07-19 19:25:13 00:02:13 2 Sensor Check # should stay
答案 1 :(得分:0)
我认为这个较短的代码应该这样做。 有两个连续循环而不是嵌套循环,这应该可以提高性能。
from datetime import datetime, timedelta
# os.walk etc.
for file in files:
if not file.startswith("input"):
continue
entries = set()
# build up entries
for line in input_file:
machine = line[0:7] #stores machine number
date = datetime.strptime(line[8:27], '%Y-%m-%d %H:%M:%S')
entries.add((machine, date))
#check entries
for line in input_file:
machine = line[0:7] #stores machine number
date = datetime.strptime(line[8:27], '%Y-%m-%d %H:%M:%S') - timedelta(hours=2)
if (machine, date) in entries:
output.write("deleted\n")
else:
output.write(line)
output.flush()