如何根据条件删除熊猫数据框中的重复项

时间:2020-10-12 07:47:50

标签: python pandas

我正在制作一个小型计数软件,该软件基本上可以计算房屋内的总人数。我从微控制器数据库中获取的数据框(允许人们进出)有一个人为错误,其中有时用户在Entry之前退出。因此,在数据帧中有一些实例,其中一个条目在另一个后续条目之前具有多个出口。 df是这样的:

date     timestamp  type    cardno      status
**20201006  55737   PC010   117016056   Valid Card Exit**
20201006    55907   PC010   117016056   Valid Card Entry
20201006    60312   PC006   100024021   Valid Card Entry
20201006    61311   PC006   100024021   Valid Card Exit
20201006    61445   PC006   100024021   Valid Card Entry
20201006    61538   PC006   100024021   Valid Card Exit
20201006    61646   PC010   117016056   Valid Card Exit
20201006    61933   PC006   100024021   Valid Card Entry
20201006    61938   PC010   117016056   Valid Card Entry
20201006    62025   PC006   100024021   Valid Card Exit
20201006    62041   PC010   117016056   Valid Card Exit
20201006    62042   PC006   100024021   Valid Card Entry
20201006    62225   PC010   117016056   Valid Card Entry
20201006    62527   PC006   100024021   Valid Card Exit
20201006    63018   PC006   100024021   Valid Card Entry
20201006    64832   PC007   116057383   Valid Card Entry
20201006    64834   PC011   117016074   Valid Card Entry
**20201006  64952   PC012   116054003   Valid Card Exit**

带**的条目基本上是员工在某个条目之前进入出口(无论出于何种原因),这使计数变得混乱。我想摆脱数据框中的所有此类实例。我很难做到这一点。到目前为止,我制作的计数软件基本上是读取一个火鸟数据库,然后从中取出不同的数据帧,继续对其形状进行计数,然后将输出作为简单HTML格式显示在房屋内的大屏幕上。我上面描述的有问题的数据帧在我正在生产(测试)中运行的程序中称为“ contractorDf”,如下所示:

import subprocess
from datetime import datetime
from datetime import date
import pandas as pd
import re
import os
import sys
   
#------------------------------------------------------PRODUCTION-----------------------------------------#
# Generating a Temporary Date for Production Environment
tempDate = date(2020, 10, 6)
tempDate = str(tempDate)
tempDate = tempDate.replace('-', '')
#------------------------------------------------------PRODUCTION----------------------------------------#
   
################################################################################################################################
# Getting Current Day (This will be used in real environment)
currentDay = datetime.now().day

if currentDay < 10:
    currentDay = str(currentDay)
    currentDay = '0'+ currentDay
else:
    currentDay = str(currentDay)


# Getting Current Year & Month
currentYear = datetime.now().year
currentMonth = datetime.now().month
currentYear = str(currentYear)
currentMonth = str(currentMonth)
currentYearMonth = currentYear+currentMonth
currentYearMonthDay = currentYearMonth+currentDay

# Getting Variable for After FROM
currentTableName = 'ST'+currentYearMonth

# Getting Final Query (Commented Right now because Testing)
query = "SELECT * FROM " + currentYearMonth + " " + "WHERE TRDATE=" + currentYearMonthDay + ";"
finalQuery = bytes(query, 'ascii')
#############################################################################################################################


#-------------------------------------------------------PRODUCTION------------------------------------------------------#
# Making a temporary Table Name and Query for Production Environment
tempTableName = 'ST'+currentYearMonth
nonByteQuery = "SELECT * FROM " + tempTableName + " " + "WHERE TRDATE=" + tempDate + ";"
tempQuery = bytes(nonByteQuery, 'ascii')
#-------------------------------------------------------PRODUCTION------------------------------------------------------#



# Generating record.csv file from command prompt (Before initiating this, C:\\Program Files (x86)\\FireBird\\FireBird_2_1\\bin should be in the environment variables)
p = subprocess.Popen('isql', shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
p.stdin.write(b'CONNECT "C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\TRANS.fdb";') #The italicized b is because its a Byte size code and we can't 
p.stdin.write(b'OUTPUT "C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv";')
p.stdin.write(tempQuery)
p.stdin.write(b'OUTPUT;')
p.communicate()
p.terminate()
# Terminating the Command Prompt Window



# Reading the record file that is just generated above
tempdf = pd.read_csv('C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv', sep='delimeter', engine='python', header=None, skipinitialspace=True)
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 1000)

#tempdf = tempdf[0].astype(str)
columns = ["TRDATE", "TRTIME", "TRCODE", "TRDESC", "CTRLTAG", "CTRLNAME", "CTRLIP", "CARDNO", "STAFFNO", "STAFFNAME", "DEPTNAME", "JOBNAME", "SHIFTNAME", "DEVTYPE", "DEVNAME", "DEVNO", "TRID", "ISCAP", "RCGROUP", "POLLTIME", "SENDSEQ", "RECSEQ", "IOBNO", "IOBNAME", "ZONENO", "ZONENAME", "POINTNO", "POINTNAME", "ISSNAPRET", "PROTRAG"]
header = tempdf.iloc[0]
linespace = tempdf.iloc[1]
header = str(header)
header = header[5:]
header = header[:-24]
linespace = str(linespace)
linespace = linespace[7:]
linespace = linespace[:-23]

tempdf = tempdf[~tempdf[0].str.contains(header)]
tempdf = tempdf[~tempdf[0].str.contains(linespace)]
tempdf = tempdf[0].str.replace(' ', ',')
df = tempdf.str.split(",", n=400, expand=True)
df = df[[0,1,7,8,9,10,31,41,42,43,52,53,54]]
df[100] = df[7].map(str) + ' ' + df[8].map(str) + ' ' + df[9].map(str) + ' ' + df[10].map(str)
df = df.drop([7,8,9,10], axis=1)
df[101] = df[31].map(str) + df[41].map(str)
df = df.drop([31,41], axis=1)
df[102] = df[43].map(str) + df[52].map(str) + df[53].map(str) + df[54].map(str)
df = df.drop([43,52,53,54], axis=1)

def newblock(column):
    if column[42].startswith('VIS'):
        return column[42]
    else:
        pass


df = df.assign(newblock=df.apply(newblock, axis=1))

df[42] = df[42].str.replace('VIS_\d\d\d\d\d\d\d\d\d\d', '')

df[105] = df[42].map(str) + df[101].map(str)
df = df.drop([42,101], axis=1)
df[106] = df[102].map(str) + df['newblock'].map(str)
df = df.drop(['newblock', 102], axis=1)
df[106] = df[106].str.replace('None', '')
df = df[[0,1,106,105,100]]
columns = ['date', 'timestamp', 'type', 'cardno', 'status']
df.columns = df.columns.map(str)
df.columns = columns
df = df.reset_index()
df = df.drop(['index'], axis=1)




#Making Visitor Counter
visitorDf = df[df['type'].str.startswith('VIS')]
#visitorDf = visitorDf[~visitorDf['status'].str.contains('Unknown')]
visitorIn1 = len(visitorDf[visitorDf['status'].str.contains('Unknown')])
VisitorIn1 = int(visitorIn1)
visitorDf = visitorDf.reset_index()
visitorDf = visitorDf.drop(('index'), axis=1)
visitorIn = len(visitorDf[visitorDf['status'].str.contains('Valid Card Entry')])
visitorOut = len(visitorDf[visitorDf['status'].str.contains('Valid Card Exit')])
visitorIn = int(visitorIn)
visitorOut = int(visitorOut)
totalVisitor = visitorIn1 + visitorIn - visitorOut

#Making Contractor Counter
contractorDf = df[df['type'].str.startswith('PC')]
#contractorDf = contractorDf[~contractorDf['status'].str.contains('Unknown')]
contractorIn1 = len(contractorDf[contractorDf['status'].str.contains('Unknown')])
contractorIn1 = int(contractorIn1)
contractorDf = contractorDf.reset_index()
contractorDf = contractorDf.drop(('index'), axis=1)
contractorIn = len(contractorDf[contractorDf['status'].str.contains('Valid Card Entry')])
contractorOut = len(contractorDf[contractorDf['status'].str.contains('Valid Card Exit')])
contractorIn = int(contractorIn)
contractorOut = int(contractorOut)
totalContractor = contractorIn1 + contractorIn - contractorOut


#Making Employee Counter
employeeDf = df[df['type'].str.contains('^\d', regex=True)]
#employeeDf = employeeDf[~employeeDf['status'].str.contains('Unknown')]
employeeIn1 = len(employeeDf[employeeDf['status'].str.contains('Unknown')])
employeeIn1 = int(employeeIn1)
employeeDf = employeeDf.reset_index()
employeeDf = employeeDf.drop(('index'), axis=1)
employeeIn = len(employeeDf[employeeDf['status'].str.contains('Valid Card Entry')])
employeeOut = len(employeeDf[employeeDf['status'].str.contains('Valid Card Exit')])
employeeIn = int(employeeIn)
employeeOut = int(employeeOut)
totalEmployee = employeeIn1 + employeeIn - employeeOut


os.remove('C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv')

visitor = totalVisitor
employee = totalEmployee
contractor = totalContractor

if os.path.exists('C:\\Apache24\\htdocs\\counter\\index.html'):
    os.remove('c:\\Apache24\\htdocs\\counter\\index.html')
else:
    pass

f = open('C:\\Apache24\\htdocs\\counter\\index.html', 'w')

message = """
<html lang="en-US" class="hide-scroll">
    <head>
        <title>Emhart Counter</title>
        <meta charset="utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
        <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" integrity="sha384-JcKb8q3iqJ61gNV9KGb8thSsNjpSL0n8PARn9HuZOnIxN0hoP+VmmDGMN5t9UJ0Z" crossorigin="anonymous">
        <style>
        body {{
            background-color: lightblue;
        }}

        .verticalCenter {{
            margin: 0;
            top: 100%;
            -ms-transform: translateY(25%);
            transform: translateY(25%);
        }}
        </style>
    </head>
    <body>
        <center>
            <div class=“verticalCenter">
                <h1 style=font-size:100px>VISITORS: &emsp;&emsp;&emsp;&emsp;&emsp;&emsp; {visitor}</h1><br></br><br></br>
                <h1 style=font-size:100px>EMPLOYEES: &emsp;&emsp;&emsp;&emsp;&emsp;&emsp; {employee}</h1><br></br><br></br>
                <h1 style=font-size:100px>CONTRACTORS: &emsp;&emsp;&emsp;&emsp;&emsp;&emsp; {contractor}</h1><br></br><br></br><br></br><br></br>
                <h3 style=font-size: 50px>THIS IS A TEST RUN<h3>
            </div>
        </center>
    </body>
</html>"""


new_message = message.format(visitor=visitor, employee=employee, contractor=contractor)
f.write(new_message)
f.close()


sys.exit()

剩下的唯一问题是,如何在carderor / df中具有对应条目的Cardno / type之前退出出口。我真的很感激对此事的任何帮助。

2 个答案:

答案 0 :(得分:0)

例如,startswithendswith可以工作。对于更复杂的正则表达式模式,请使用contains

mask = df.date.str.startswith("**")
print(df[mask])

# or

mask = df.status.str.endswith("**")
print(df[mask])

输出:

         date timestamp   type     cardno             status
0  **20201006     55737  PC010  117016056  Valid_Card_Exit**
3  **20201006     64952  PC012  116054003  Valid_Card_Exit**

设置:

columns = ['date','timestamp','type','cardno','status']
data = [el.split(",") for el in ['**20201006,55737,PC010,117016056,Valid_Card_Exit**',
'20201006,55907,PC010,117016056,Valid_Card_Entry',
'20201006,64834,PC011,117016074,Valid_Card_Entry',
'**20201006,64952,PC012,116054003,Valid_Card_Exit**']]
df = pd.DataFrame(data, columns=columns)

答案 1 :(得分:0)

累积技巧

该问题的关键是一个常见的数学技巧。我们首先将entry视为1,并将exit视为取消条目,即-1。然后,退出事件很糟糕,如果它首先在该行之前产生负的累积总和(cumsum)。即,当退出事件发生时,不能解释为对先前条目的适当取消。但是,请注意,后续的cumsum负值可能是由先前的错误值引起的。因此,我们仅将第一个负的累积值确定为错误。

基于上述观察,可以递归的方式找到每张卡的第一个错误条目,直到没有负的累积值产生为止。

代码

该实现演示了如何递归执行此操作。对于大型数据集,它还没有进行优化,但是优点应该是相似的。

# initialize
df["retain"] = True
df["delta"] = -1
df.loc[df["status"] == "Valid Card Entry", "delta"] = 1

def recurse(df):

    # sort for cumsum (bad values found were not retained)
    df_sorted = df[df["retain"]].sort_values(by=["cardno", "timestamp"]).reset_index(drop=True)

    # cumsum
    df_sorted["cumsum"] = df_sorted[["cardno", "delta"]].groupby("cardno").cumsum()

    # get the first occurrence of negative cumsum
    df_dup = df_sorted[df_sorted["cumsum"] < 0].groupby("cardno").first()

    # termination condition: no more bad values were found
    if len(df_dup) == 0:
        return

    # else, remove the bad rows
    for cardno, row in df_dup.iterrows():
        df.loc[(df["cardno"] == cardno) & (df["timestamp"] == row["timestamp"]), "retain"] = False

# execute    
recurse(df)

del df["delta"]  # optional cleanup

输出

请参阅“保留”列(False =错误的出口)。

df
Out[61]: 
        date  timestamp   type     cardno            status  retain
0   20201006      55737  PC010  117016056   Valid Card Exit   False
1   20201006      55907  PC010  117016056  Valid Card Entry    True
2   20201006      60312  PC006  100024021  Valid Card Entry    True
3   20201006      61311  PC006  100024021   Valid Card Exit    True
4   20201006      61445  PC006  100024021  Valid Card Entry    True
5   20201006      61538  PC006  100024021   Valid Card Exit    True
6   20201006      61646  PC010  117016056   Valid Card Exit    True
7   20201006      61933  PC006  100024021  Valid Card Entry    True
8   20201006      61938  PC010  117016056  Valid Card Entry    True
9   20201006      62025  PC006  100024021   Valid Card Exit    True
10  20201006      62041  PC010  117016056   Valid Card Exit    True
11  20201006      62042  PC006  100024021  Valid Card Entry    True
12  20201006      62225  PC010  117016056  Valid Card Entry    True
13  20201006      62527  PC006  100024021   Valid Card Exit    True
14  20201006      63018  PC006  100024021  Valid Card Entry    True
15  20201006      64832  PC007  116057383  Valid Card Entry    True
16  20201006      64834  PC011  117016074  Valid Card Entry    True
17  20201006      64952  PC012  116054003   Valid Card Exit   False

出于演示目的,下面显示了清理前后的cumsum s。数据集按(cardno, timestamp)排序,为清楚起见,删除了date列。

之前

df_sorted
Out[69]: 
    timestamp   type     cardno            status  retain  delta  cumsum
0       60312  PC006  100024021  Valid Card Entry    True      1       1
1       61311  PC006  100024021   Valid Card Exit    True     -1       0
2       61445  PC006  100024021  Valid Card Entry    True      1       1
3       61538  PC006  100024021   Valid Card Exit    True     -1       0
4       61933  PC006  100024021  Valid Card Entry    True      1       1
5       62025  PC006  100024021   Valid Card Exit    True     -1       0
6       62042  PC006  100024021  Valid Card Entry    True      1       1
7       62527  PC006  100024021   Valid Card Exit    True     -1       0
8       63018  PC006  100024021  Valid Card Entry    True      1       1
9       64952  PC012  116054003   Valid Card Exit    True     -1      -1
10      64832  PC007  116057383  Valid Card Entry    True      1       1
11      55737  PC010  117016056   Valid Card Exit    True     -1      -1
12      55907  PC010  117016056  Valid Card Entry    True      1       0
13      61646  PC010  117016056   Valid Card Exit    True     -1      -1
14      61938  PC010  117016056  Valid Card Entry    True      1       0
15      62041  PC010  117016056   Valid Card Exit    True     -1      -1
16      62225  PC010  117016056  Valid Card Entry    True      1       0
17      64834  PC011  117016074  Valid Card Entry    True      1       1

之后

df_sorted
Out[73]: 
    timestamp   type     cardno            status  retain  delta  cumsum
0       60312  PC006  100024021  Valid Card Entry    True      1       1
1       61311  PC006  100024021   Valid Card Exit    True     -1       0
2       61445  PC006  100024021  Valid Card Entry    True      1       1
3       61538  PC006  100024021   Valid Card Exit    True     -1       0
4       61933  PC006  100024021  Valid Card Entry    True      1       1
5       62025  PC006  100024021   Valid Card Exit    True     -1       0
6       62042  PC006  100024021  Valid Card Entry    True      1       1
7       62527  PC006  100024021   Valid Card Exit    True     -1       0
8       63018  PC006  100024021  Valid Card Entry    True      1       1
9       64832  PC007  116057383  Valid Card Entry    True      1       1
10      55907  PC010  117016056  Valid Card Entry    True      1       1
11      61646  PC010  117016056   Valid Card Exit    True     -1       0
12      61938  PC010  117016056  Valid Card Entry    True      1       1
13      62041  PC010  117016056   Valid Card Exit    True     -1       0
14      62225  PC010  117016056  Valid Card Entry    True      1       1
15      64834  PC011  117016074  Valid Card Entry    True      1       1