我的csv文件的数据格式如下
date,time,event,user,net
。
我需要遍历这个文件的每一行,如果event == start,
继续直到它到达同一用户和网络的事件==结束,
然后计算两个事件之间的时差。
我有这段代码:
import csv
import datetime
import time
with open('dates.csv', 'rb') as csv_file:
csv_read = csv.reader(csv_file)
for row in csv_read:
if row[2]=="start":
n1=datetime.datetime.strptime(row[1], '%H:%M:%S')
for row2 in csv_read:
if (row2[2]=="End" and row[3]==row2[3] and row[4]==row2[4]):
n2=datetime.datetime.strptime(row2[1], '%H:%M:%S')
print row[2],row[1], row2[2], row2[1]
diff = n2 - n1
print "time difference = ", diff.seconds
break
但是这段代码的问题是当它找到匹配时#34;结束"并计算时间,它将在匹配后开始搜索"结束",忽略前面的行。 作为一个例子
May,20,9:02:22,2010,start,user1,net-3
May,20,9:02:23,2010,start,user1,net-3
May,20,9:02:55,2010,start,user1,net-2
May,20,9:02:55,2010,End,user1,net-3
May,20,9:03:43,2010,End,user1,net-2
May,20,9:02:55,2010,End,user1,net-3
May,20,9:03:43,2010,End,user1,net-2
May,20,9:03:44,2010,start,user1,net-2
May,20,9:03:49,2010,End,user1,net-2
只会产生以下输出:
Connect 9:02:22 Disconnect 9:02:55
time difference = 33
Connect 9:03:44 Disconnect 9:03:49
time difference = 5
那么有人知道如何解决这个问题吗? 还可以将时差作为额外列添加到现有数据中吗?
感谢
我已经更新了代码,但是现在我遇到了一个新问题,我的文件包含35734行,但输出文件只包含350行,我很困惑为什么会发生这种情况,谢谢,我将不胜感激 更新的代码:
l1=[] ## empty list
l2=[] ## empty list
csv_file=open('dates_read.csv', 'r+')
csv_wfile=open('dates_write.csv', 'w+')
csv_read = csv.reader(csv_file)
csv_read1 = csv_read
csv_write = csv.writer(csv_wfile)
for row in csv_read:
s=csv_read.line_num
if (row[4]=="start" and (s not in l1)):
n1=datetime.datetime.strptime(row[2], '%H:%M:%S')
l1.append(s)
month = str(row[0])
day = int(row[1])
time = str(row[2])
year = int(row[3])
user = str(row[5])
net = str(row[6])
dwell_time = str(row[7])
for row2 in csv_read1:
e=csv_read1.line_num
if (row2[4]=="End" and row[5]==row2[5] and row[6]==row2[6] and (csv_read1.line_num not in l2)and s<e):
n2=datetime.datetime.strptime(row2[2], '%H:%M:%S')
diff = n2 - n1
dwell_time= diff
print("time difference = ", diff.seconds,"\n")
csv_write.writerow([month, day, time, year, user, net, dwell_time])
l2.append(e)
break
print (s) #prints 818
print (e) #prints 35734
答案 0 :(得分:4)
您的代码唯一的问题是,在遇到第一个 START 后,您正在遍历 END 关键字的行。相反,它应该从头开始遍历文件的行。 有了这个,我们还必须考虑到同一条线不会再次被遍历。为此,我们可以使用一个列表,该列表可以保存已经遍历的行的行号。
我没有编写新代码,只对代码进行了更改。
>>> l=[] ## empty list
>>> csv_file=open('dates.csv')
>>> csv_read = csv.reader(csv_file)
>>> for row in csv_read:
if row[0].split()[4]=="start":
n1=datetime.datetime.strptime(row[0].split()[2], '%H:%M:%S')
s=csv_read.line_num
csv_file1=open('/Python34/Workspace/Stoverflow/dates.csv')
csv_read1 = csv.reader(csv_file1)
for row2 in csv_read1:
e=csv_read1.line_num
## Inside if iam adding to more checks that verify that the same line is not traversed again and the END flag is always encountered after START flag
if (row2[0].split()[4]=="End" and row[0].split()[6]==row2[0].split()[6] and row[0].split()[5]==row2[0].split()[5] and (csv_read1.line_num not in l) and s<csv_read1.line_num):
n2=datetime.datetime.strptime(row2[0].split()[2], '%H:%M:%S')
print("Connect : ",row[0].split()[2]," Disconnect :",row2[0].split()[2])
diff = n2 - n1
print("time difference = ", diff.seconds,"\n")
l.append(csv_read1.line_num)
del csv_read1
break
del csv_file1
答案 1 :(得分:1)
我觉得使用地图解决这个问题会更好。
将(user_id,net_id)定义为键,将(start_status,start_time)定义为值,如下所示:
class UserNet:
user_id = -1
net_id = -1
// Other Operation
class StartStatus:
start_flag = False
start_time = -1
// Other Operation
当您读取一行时,首先判断该行中的状态字符串是START还是END。
如果结束,则使用 从该行读取以在新地图结构中查找,找到start_time和minus以获得答案。
如果它是START,则将该值插入新的地图结构中。
如果您不想要错误判断,那么start_flag是不必要的,它的标志代表重复启动,也许您不需要它。