Python中用于根据具有开始或停止条件的行之间的条件查找行的方法

时间:2019-06-06 04:02:07

标签: python-3.x pandas

我需要确定与票证搜索关联的呼叫,并在新列中添加一个值,以跟踪相关性。数据按时间顺序排序。

我的数据模式如下:

TIME,INDEX,TYPE,SPLIT,LOGIN,USERNAME,TICKETUD,ACCOUNTID
3/10/2019 14:53,1853,START,111,xxxx732,USER_1,,
3/10/2019 14:54,1848,START,111,xxxx738,USER_4,,
3/10/2019 14:55,1850,START,111,xxxx726,USER_2,,
3/10/2019 14:55,1846,STOP,333,xxxx744,USER_5,,
3/10/2019 14:56,1849,START,333,xxxx744,USER_5,,
3/10/2019 14:57,1855,START,333,xxxx738,USER_4,,
3/10/2019 14:57,0,SEARCH,,xxxx732,USER_1,xxxxx39,
3/10/2019 14:59,1852,START,333,xxxx726,USER_2,,
3/10/2019 15:00,1847,STOP,333,xxxx744,USER_5,,
3/10/2019 15:00,1848,STOP,111,xxxx738,USER_4,,
3/10/2019 15:03,1849,STOP,333,xxxx744,USER_5,,
3/10/2019 15:04,0,SEARCH,,xxxx732,USER_1,xxxxx39,
3/10/2019 15:05,1854,START,333,xxxx619,USER_3,,
3/10/2019 15:05,1850,STOP,111,xxxx726,USER_2,,
3/10/2019 15:07,1851,STOP,333,xxxx619,USER_3,,
3/10/2019 15:08,1852,STOP,333,xxxx726,USER_2,,
3/10/2019 15:09,1856,START,333,xxxx732,USER_1,,
3/10/2019 15:09,1858,START,333,xxxx619,USER_3,,
3/10/2019 15:09,1860,START,222,xxxx726,USER_2,,
3/10/2019 15:11,1853,STOP,111,xxxx732,USER_1,,

第INDEX列包含给定电话的唯一ID。
TYPE列包含电话的START和STOP值以及指示票证搜索的SEARCH值。 相关性的关键是LOGIN,它跟踪用户ID。

在“开始”时,我需要先找到一个相关的SEARCH,然后才能到达“停止”状态;如果某个集合包含“开始”搜索(可能是多次搜索)“停止”模式,则需要将该集合标记为已连接,并且可能要对1、2等集合进行计数,3。

设置示例

TIME:3/10/2019 14:53 INDEX:1853 TYPE:START LOGIN:xxxx732
TIME:3/10/2019 14:57 INDEX:0 TYPE:SEARCH LOGIN:xxxx732
TIME:3/10/2019 15:04 INDEX:0 TYPE:SEARCH LOGIN:xxxx732
TIME:3/10/2019 15:11 INDEX:1853 TYPE:STOP LOGIN:xxxx732

我的数据在一个标记为Combined.csv的CSV文件中

我已经能够加载数据并基于多个条件隔离特定的行,或者在行满足条件时分配true / false,

但是我不知道如何在诸如TYPE:USER:和INDEX这样的一组条件下触发迭代

import pandas as pd
data = pd.read_csv("combined.csv")
df = pd.DataFrame(data)

# df['TEST'] = df['INDEX'].apply(lambda x: 'True' if x == 1 else 'False')
# print(df)


# test = df[(df.TYPE == "START") | (df.INDEX == 1)]
# print(test)

更新

此时删除此帖子或提交更新作为答案是否有意义?

我成功地将CSv转换为熊猫。 有关当前状态,请参见下文。

我的数据模式如下:

TIME,ID,TYPE,SPLIT,LOGIN,USERNAME,TICKETUD,ACCOUNTID
3/10/2019 14:53,1853,START,111,xxxx732,USER_1,,
3/10/2019 14:54,1848,START,111,xxxx738,USER_4,,
3/10/2019 14:55,1850,START,111,xxxx726,USER_2,,
3/10/2019 14:56,1849,START,333,xxxx744,USER_5,,
3/10/2019 14:57,1855,START,333,xxxx738,USER_4,,
3/10/2019 14:57,0,SEARCH,,xxxx732,USER_1,xxxxx39,
3/10/2019 14:58,0,SEARCH,,xxxx732,USER_1,,xxxxx21
3/10/2019 14:59,1852,START,333,xxxx726,USER_2,,
3/10/2019 15:00,1848,STOP,111,xxxx738,USER_4,,
3/10/2019 15:03,1849,STOP,333,xxxx744,USER_5,,
3/10/2019 15:04,0,SEARCH,,xxxx732,USER_1,xxxxx39,
3/10/2019 15:05,1854,START,333,xxxx619,USER_3,,
3/10/2019 15:05,1850,STOP,111,xxxx726,USER_2,,
3/10/2019 15:08,1852,STOP,333,xxxx726,USER_2,,
3/10/2019 15:11,1853,STOP,111,xxxx732,USER_1,,
3/10/2019 15:12,1855,STOP,333,xxxx738,USER_4,,

数据已按时间顺序排序 列ID包含给定电话的唯一ID。 TYPE列包含电话的START和STOP值以及指示票证搜索的SEARCH值。相关性的关键是LOGIN,它跟踪用户ID。

在“开始”时,我需要先找到一个相关的SEARCH,然后才能到达“停止”状态;如果某个集合包含“开始”搜索(可能是多次搜索)“停止”模式,则需要将该集合标记为已连接,并且可能要对1、2等集合进行计数,3。

设置示例

TIME:3/10/2019 14:53 ID:1853 TYPE:START LOGIN:xxxx732
TIME:3/10/2019 14:57 ID:0 TYPE:SEARCH LOGIN:xxxx732
TIME:3/10/2019 14:58,0,SEARCH,,xxxx732,USER_1,,xxxxx21
TIME:3/10/2019 15:04 ID:0 TYPE:SEARCH LOGIN:xxxx732
TIME:3/10/2019 15:11 ID:1853 TYPE:STOP LOGIN:xxxx732

这是我现在拼凑在一起的代码,用于在具有相同ID的START和STOP之间通过登录查找SEARCH。

import csv

call = 1853

def FINDSTART(call):
    with open('combined_3.csv') as f:
        reader = csv.DictReader(f)
        for row in reader:
            time = str(row['TIME'])
            id = int(row['ID'])
            type = str(row['TYPE'])
            skill = str(row['DISPSPLIT'])
            login = str(row['ANSLOGIN'])
            if id == call and type == "START":
                arow = int(reader.line_num)
                print(arow,time,id,skill,login)
                return (reader.line_num), (time), (login)

def FINDSTOP(call):
    with open('combined_3.csv') as f2:
        reader = csv.DictReader(f2)
        for row in reader:
            time = str(row['TIME'])
            id = int(row['ID'])
            type = str(row['TYPE'])
            skill = str(row['DISPSPLIT'])
            login = str(row['ANSLOGIN'])
            if id == call and type == "STOP":
                brow = int(reader.line_num)
                print(brow,time,id,skill,login)
                return (reader.line_num), (time), (login)

def FINDSEARCH(aline,bline,aL):
        with open('combined_3.csv') as f3:
            reader = csv.DictReader(f3)
            for row in reader:
                time = str(row['TIME'])
                type = str(row['TYPE'])
                login = str(row['ANSLOGIN'])
                ticket = str(row['TICKETUD'])
                account = str(row['ACCOUNTID'])
                arow = int(aline)
                brow = int(bline)
                crow = int(reader.line_num)
                if type == "SEARCH" and aL == login and arow < crow < brow:
                    print(reader.line_num,time,login,ticket,account)
                    return (reader.line_num), (time), (login), (ticket), (account)


aLine, aT, aL = FINDSTART(call)

bLine, bT, bL = FINDSTOP(call)

cline, time, login, ticket, account = FINDSEARCH(aLine,bLine,aL)

print("Search" + ", " + time + ", " + login + ", " + ticket + ", " + account)

这是代码的结果。

testfuct.py 
37928 3/10/2019 14:53 1853 708 1671732
37932 3/10/2019 15:11 1853 708 1671732
37929 3/10/2019 14:57 1671732 60954939
Search, 3/10/2019 14:57, 1671732, 60954939,

杰出目标: 计算aLine和bLine之间的差异。 使用该计数迭代搜索以查找aLine和bLine之间的所有行中的匹配项。 确定是否有更好的方法来读取数据文件,而不是打开每个函数。

1 个答案:

答案 0 :(得分:0)

要计算通话次数,您可以在INDEX列上确定groupby的长度:

mask_search = df['TYPE']=='SEARCH'
df_no_search = df.drop(df[mask_search].index)

# Number of calls 
print(len(df_no_search .groupby(['INDEX']).size())-1)

# To count the number of unique calls each day : 
df.drop(df[mask_search].index).groupby(['DATE'])['INDEX'].nunique()

要获取每个通话的详细信息,可以执行几个groupby的操作:

df.groupby(['LOGIN', 'TIME','INDEX', 'TYPE' ]).size()

输出:

LOGIN    TIME             INDEX  TYPE  
xxxx619  3/10/2019 15:05  1854   START     1
         3/10/2019 15:07  1851   STOP      1
         3/10/2019 15:09  1858   START     1
xxxx726  3/10/2019 14:55  1850   START     1
         3/10/2019 14:59  1852   START     1
         3/10/2019 15:05  1850   STOP      1
         3/10/2019 15:08  1852   STOP      1
         3/10/2019 15:09  1860   START     1
xxxx732  3/10/2019 14:53  1853   START     1
         3/10/2019 14:57  0      SEARCH    1
         3/10/2019 15:04  0      SEARCH    1
         3/10/2019 15:09  1856   START     1
         3/10/2019 15:11  1853   STOP      1
xxxx738  3/10/2019 14:54  1848   START     1
         3/10/2019 14:57  1855   START     1
         3/10/2019 15:00  1848   STOP      1
xxxx744  3/10/2019 14:55  1846   STOP      1
         3/10/2019 14:56  1849   START     1
         3/10/2019 15:00  1847   STOP      1
         3/10/2019 15:03  1849   STOP      1