我需要确定与票证搜索关联的呼叫,并在新列中添加一个值,以跟踪相关性。数据按时间顺序排序。
我的数据模式如下:
TIME,INDEX,TYPE,SPLIT,LOGIN,USERNAME,TICKETUD,ACCOUNTID
3/10/2019 14:53,1853,START,111,xxxx732,USER_1,,
3/10/2019 14:54,1848,START,111,xxxx738,USER_4,,
3/10/2019 14:55,1850,START,111,xxxx726,USER_2,,
3/10/2019 14:55,1846,STOP,333,xxxx744,USER_5,,
3/10/2019 14:56,1849,START,333,xxxx744,USER_5,,
3/10/2019 14:57,1855,START,333,xxxx738,USER_4,,
3/10/2019 14:57,0,SEARCH,,xxxx732,USER_1,xxxxx39,
3/10/2019 14:59,1852,START,333,xxxx726,USER_2,,
3/10/2019 15:00,1847,STOP,333,xxxx744,USER_5,,
3/10/2019 15:00,1848,STOP,111,xxxx738,USER_4,,
3/10/2019 15:03,1849,STOP,333,xxxx744,USER_5,,
3/10/2019 15:04,0,SEARCH,,xxxx732,USER_1,xxxxx39,
3/10/2019 15:05,1854,START,333,xxxx619,USER_3,,
3/10/2019 15:05,1850,STOP,111,xxxx726,USER_2,,
3/10/2019 15:07,1851,STOP,333,xxxx619,USER_3,,
3/10/2019 15:08,1852,STOP,333,xxxx726,USER_2,,
3/10/2019 15:09,1856,START,333,xxxx732,USER_1,,
3/10/2019 15:09,1858,START,333,xxxx619,USER_3,,
3/10/2019 15:09,1860,START,222,xxxx726,USER_2,,
3/10/2019 15:11,1853,STOP,111,xxxx732,USER_1,,
第INDEX列包含给定电话的唯一ID。
TYPE列包含电话的START和STOP值以及指示票证搜索的SEARCH值。
相关性的关键是LOGIN,它跟踪用户ID。
在“开始”时,我需要先找到一个相关的SEARCH,然后才能到达“停止”状态;如果某个集合包含“开始”搜索(可能是多次搜索)“停止”模式,则需要将该集合标记为已连接,并且可能要对1、2等集合进行计数,3。
设置示例
TIME:3/10/2019 14:53 INDEX:1853 TYPE:START LOGIN:xxxx732
TIME:3/10/2019 14:57 INDEX:0 TYPE:SEARCH LOGIN:xxxx732
TIME:3/10/2019 15:04 INDEX:0 TYPE:SEARCH LOGIN:xxxx732
TIME:3/10/2019 15:11 INDEX:1853 TYPE:STOP LOGIN:xxxx732
我的数据在一个标记为Combined.csv的CSV文件中
我已经能够加载数据并基于多个条件隔离特定的行,或者在行满足条件时分配true / false,
但是我不知道如何在诸如TYPE:USER:和INDEX这样的一组条件下触发迭代
import pandas as pd
data = pd.read_csv("combined.csv")
df = pd.DataFrame(data)
# df['TEST'] = df['INDEX'].apply(lambda x: 'True' if x == 1 else 'False')
# print(df)
# test = df[(df.TYPE == "START") | (df.INDEX == 1)]
# print(test)
更新
此时删除此帖子或提交更新作为答案是否有意义?
我成功地将CSv转换为熊猫。 有关当前状态,请参见下文。
我的数据模式如下:
TIME,ID,TYPE,SPLIT,LOGIN,USERNAME,TICKETUD,ACCOUNTID
3/10/2019 14:53,1853,START,111,xxxx732,USER_1,,
3/10/2019 14:54,1848,START,111,xxxx738,USER_4,,
3/10/2019 14:55,1850,START,111,xxxx726,USER_2,,
3/10/2019 14:56,1849,START,333,xxxx744,USER_5,,
3/10/2019 14:57,1855,START,333,xxxx738,USER_4,,
3/10/2019 14:57,0,SEARCH,,xxxx732,USER_1,xxxxx39,
3/10/2019 14:58,0,SEARCH,,xxxx732,USER_1,,xxxxx21
3/10/2019 14:59,1852,START,333,xxxx726,USER_2,,
3/10/2019 15:00,1848,STOP,111,xxxx738,USER_4,,
3/10/2019 15:03,1849,STOP,333,xxxx744,USER_5,,
3/10/2019 15:04,0,SEARCH,,xxxx732,USER_1,xxxxx39,
3/10/2019 15:05,1854,START,333,xxxx619,USER_3,,
3/10/2019 15:05,1850,STOP,111,xxxx726,USER_2,,
3/10/2019 15:08,1852,STOP,333,xxxx726,USER_2,,
3/10/2019 15:11,1853,STOP,111,xxxx732,USER_1,,
3/10/2019 15:12,1855,STOP,333,xxxx738,USER_4,,
数据已按时间顺序排序 列ID包含给定电话的唯一ID。 TYPE列包含电话的START和STOP值以及指示票证搜索的SEARCH值。相关性的关键是LOGIN,它跟踪用户ID。
在“开始”时,我需要先找到一个相关的SEARCH,然后才能到达“停止”状态;如果某个集合包含“开始”搜索(可能是多次搜索)“停止”模式,则需要将该集合标记为已连接,并且可能要对1、2等集合进行计数,3。
设置示例
TIME:3/10/2019 14:53 ID:1853 TYPE:START LOGIN:xxxx732
TIME:3/10/2019 14:57 ID:0 TYPE:SEARCH LOGIN:xxxx732
TIME:3/10/2019 14:58,0,SEARCH,,xxxx732,USER_1,,xxxxx21
TIME:3/10/2019 15:04 ID:0 TYPE:SEARCH LOGIN:xxxx732
TIME:3/10/2019 15:11 ID:1853 TYPE:STOP LOGIN:xxxx732
这是我现在拼凑在一起的代码,用于在具有相同ID的START和STOP之间通过登录查找SEARCH。
import csv
call = 1853
def FINDSTART(call):
with open('combined_3.csv') as f:
reader = csv.DictReader(f)
for row in reader:
time = str(row['TIME'])
id = int(row['ID'])
type = str(row['TYPE'])
skill = str(row['DISPSPLIT'])
login = str(row['ANSLOGIN'])
if id == call and type == "START":
arow = int(reader.line_num)
print(arow,time,id,skill,login)
return (reader.line_num), (time), (login)
def FINDSTOP(call):
with open('combined_3.csv') as f2:
reader = csv.DictReader(f2)
for row in reader:
time = str(row['TIME'])
id = int(row['ID'])
type = str(row['TYPE'])
skill = str(row['DISPSPLIT'])
login = str(row['ANSLOGIN'])
if id == call and type == "STOP":
brow = int(reader.line_num)
print(brow,time,id,skill,login)
return (reader.line_num), (time), (login)
def FINDSEARCH(aline,bline,aL):
with open('combined_3.csv') as f3:
reader = csv.DictReader(f3)
for row in reader:
time = str(row['TIME'])
type = str(row['TYPE'])
login = str(row['ANSLOGIN'])
ticket = str(row['TICKETUD'])
account = str(row['ACCOUNTID'])
arow = int(aline)
brow = int(bline)
crow = int(reader.line_num)
if type == "SEARCH" and aL == login and arow < crow < brow:
print(reader.line_num,time,login,ticket,account)
return (reader.line_num), (time), (login), (ticket), (account)
aLine, aT, aL = FINDSTART(call)
bLine, bT, bL = FINDSTOP(call)
cline, time, login, ticket, account = FINDSEARCH(aLine,bLine,aL)
print("Search" + ", " + time + ", " + login + ", " + ticket + ", " + account)
这是代码的结果。
testfuct.py
37928 3/10/2019 14:53 1853 708 1671732
37932 3/10/2019 15:11 1853 708 1671732
37929 3/10/2019 14:57 1671732 60954939
Search, 3/10/2019 14:57, 1671732, 60954939,
杰出目标: 计算aLine和bLine之间的差异。 使用该计数迭代搜索以查找aLine和bLine之间的所有行中的匹配项。 确定是否有更好的方法来读取数据文件,而不是打开每个函数。
答案 0 :(得分:0)
要计算通话次数,您可以在INDEX列上确定groupby
的长度:
mask_search = df['TYPE']=='SEARCH'
df_no_search = df.drop(df[mask_search].index)
# Number of calls
print(len(df_no_search .groupby(['INDEX']).size())-1)
# To count the number of unique calls each day :
df.drop(df[mask_search].index).groupby(['DATE'])['INDEX'].nunique()
要获取每个通话的详细信息,可以执行几个groupby的操作:
df.groupby(['LOGIN', 'TIME','INDEX', 'TYPE' ]).size()
输出:
LOGIN TIME INDEX TYPE
xxxx619 3/10/2019 15:05 1854 START 1
3/10/2019 15:07 1851 STOP 1
3/10/2019 15:09 1858 START 1
xxxx726 3/10/2019 14:55 1850 START 1
3/10/2019 14:59 1852 START 1
3/10/2019 15:05 1850 STOP 1
3/10/2019 15:08 1852 STOP 1
3/10/2019 15:09 1860 START 1
xxxx732 3/10/2019 14:53 1853 START 1
3/10/2019 14:57 0 SEARCH 1
3/10/2019 15:04 0 SEARCH 1
3/10/2019 15:09 1856 START 1
3/10/2019 15:11 1853 STOP 1
xxxx738 3/10/2019 14:54 1848 START 1
3/10/2019 14:57 1855 START 1
3/10/2019 15:00 1848 STOP 1
xxxx744 3/10/2019 14:55 1846 STOP 1
3/10/2019 14:56 1849 START 1
3/10/2019 15:00 1847 STOP 1
3/10/2019 15:03 1849 STOP 1