我正在制作一个小型计数软件,该软件基本上可以计算房屋内的总人数。我从微控制器数据库中获取的数据框(允许人们进出)有一个人为错误,其中有时用户在Entry之前退出。因此,在数据帧中有一些实例,其中一个条目在另一个后续条目之前具有多个出口。 df是这样的:
date timestamp type cardno status
**20201006 55737 PC010 117016056 Valid Card Exit**
20201006 55907 PC010 117016056 Valid Card Entry
20201006 60312 PC006 100024021 Valid Card Entry
20201006 61311 PC006 100024021 Valid Card Exit
20201006 61445 PC006 100024021 Valid Card Entry
20201006 61538 PC006 100024021 Valid Card Exit
20201006 61646 PC010 117016056 Valid Card Exit
20201006 61933 PC006 100024021 Valid Card Entry
20201006 61938 PC010 117016056 Valid Card Entry
20201006 62025 PC006 100024021 Valid Card Exit
20201006 62041 PC010 117016056 Valid Card Exit
20201006 62042 PC006 100024021 Valid Card Entry
20201006 62225 PC010 117016056 Valid Card Entry
20201006 62527 PC006 100024021 Valid Card Exit
20201006 63018 PC006 100024021 Valid Card Entry
20201006 64832 PC007 116057383 Valid Card Entry
20201006 64834 PC011 117016074 Valid Card Entry
**20201006 64952 PC012 116054003 Valid Card Exit**
带**的条目基本上是员工在某个条目之前进入出口(无论出于何种原因),这使计数变得混乱。我想摆脱数据框中的所有此类实例。我很难做到这一点。到目前为止,我制作的计数软件基本上是读取一个火鸟数据库,然后从中取出不同的数据帧,继续对其形状进行计数,然后将输出作为简单HTML格式显示在房屋内的大屏幕上。我上面描述的有问题的数据帧在我正在生产(测试)中运行的程序中称为“ contractorDf”,如下所示:
import subprocess
from datetime import datetime
from datetime import date
import pandas as pd
import re
import os
import sys
#------------------------------------------------------PRODUCTION-----------------------------------------#
# Generating a Temporary Date for Production Environment
tempDate = date(2020, 10, 6)
tempDate = str(tempDate)
tempDate = tempDate.replace('-', '')
#------------------------------------------------------PRODUCTION----------------------------------------#
################################################################################################################################
# Getting Current Day (This will be used in real environment)
currentDay = datetime.now().day
if currentDay < 10:
currentDay = str(currentDay)
currentDay = '0'+ currentDay
else:
currentDay = str(currentDay)
# Getting Current Year & Month
currentYear = datetime.now().year
currentMonth = datetime.now().month
currentYear = str(currentYear)
currentMonth = str(currentMonth)
currentYearMonth = currentYear+currentMonth
currentYearMonthDay = currentYearMonth+currentDay
# Getting Variable for After FROM
currentTableName = 'ST'+currentYearMonth
# Getting Final Query (Commented Right now because Testing)
query = "SELECT * FROM " + currentYearMonth + " " + "WHERE TRDATE=" + currentYearMonthDay + ";"
finalQuery = bytes(query, 'ascii')
#############################################################################################################################
#-------------------------------------------------------PRODUCTION------------------------------------------------------#
# Making a temporary Table Name and Query for Production Environment
tempTableName = 'ST'+currentYearMonth
nonByteQuery = "SELECT * FROM " + tempTableName + " " + "WHERE TRDATE=" + tempDate + ";"
tempQuery = bytes(nonByteQuery, 'ascii')
#-------------------------------------------------------PRODUCTION------------------------------------------------------#
# Generating record.csv file from command prompt (Before initiating this, C:\\Program Files (x86)\\FireBird\\FireBird_2_1\\bin should be in the environment variables)
p = subprocess.Popen('isql', shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
p.stdin.write(b'CONNECT "C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\TRANS.fdb";') #The italicized b is because its a Byte size code and we can't
p.stdin.write(b'OUTPUT "C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv";')
p.stdin.write(tempQuery)
p.stdin.write(b'OUTPUT;')
p.communicate()
p.terminate()
# Terminating the Command Prompt Window
# Reading the record file that is just generated above
tempdf = pd.read_csv('C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv', sep='delimeter', engine='python', header=None, skipinitialspace=True)
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 1000)
#tempdf = tempdf[0].astype(str)
columns = ["TRDATE", "TRTIME", "TRCODE", "TRDESC", "CTRLTAG", "CTRLNAME", "CTRLIP", "CARDNO", "STAFFNO", "STAFFNAME", "DEPTNAME", "JOBNAME", "SHIFTNAME", "DEVTYPE", "DEVNAME", "DEVNO", "TRID", "ISCAP", "RCGROUP", "POLLTIME", "SENDSEQ", "RECSEQ", "IOBNO", "IOBNAME", "ZONENO", "ZONENAME", "POINTNO", "POINTNAME", "ISSNAPRET", "PROTRAG"]
header = tempdf.iloc[0]
linespace = tempdf.iloc[1]
header = str(header)
header = header[5:]
header = header[:-24]
linespace = str(linespace)
linespace = linespace[7:]
linespace = linespace[:-23]
tempdf = tempdf[~tempdf[0].str.contains(header)]
tempdf = tempdf[~tempdf[0].str.contains(linespace)]
tempdf = tempdf[0].str.replace(' ', ',')
df = tempdf.str.split(",", n=400, expand=True)
df = df[[0,1,7,8,9,10,31,41,42,43,52,53,54]]
df[100] = df[7].map(str) + ' ' + df[8].map(str) + ' ' + df[9].map(str) + ' ' + df[10].map(str)
df = df.drop([7,8,9,10], axis=1)
df[101] = df[31].map(str) + df[41].map(str)
df = df.drop([31,41], axis=1)
df[102] = df[43].map(str) + df[52].map(str) + df[53].map(str) + df[54].map(str)
df = df.drop([43,52,53,54], axis=1)
def newblock(column):
if column[42].startswith('VIS'):
return column[42]
else:
pass
df = df.assign(newblock=df.apply(newblock, axis=1))
df[42] = df[42].str.replace('VIS_\d\d\d\d\d\d\d\d\d\d', '')
df[105] = df[42].map(str) + df[101].map(str)
df = df.drop([42,101], axis=1)
df[106] = df[102].map(str) + df['newblock'].map(str)
df = df.drop(['newblock', 102], axis=1)
df[106] = df[106].str.replace('None', '')
df = df[[0,1,106,105,100]]
columns = ['date', 'timestamp', 'type', 'cardno', 'status']
df.columns = df.columns.map(str)
df.columns = columns
df = df.reset_index()
df = df.drop(['index'], axis=1)
#Making Visitor Counter
visitorDf = df[df['type'].str.startswith('VIS')]
#visitorDf = visitorDf[~visitorDf['status'].str.contains('Unknown')]
visitorIn1 = len(visitorDf[visitorDf['status'].str.contains('Unknown')])
VisitorIn1 = int(visitorIn1)
visitorDf = visitorDf.reset_index()
visitorDf = visitorDf.drop(('index'), axis=1)
visitorIn = len(visitorDf[visitorDf['status'].str.contains('Valid Card Entry')])
visitorOut = len(visitorDf[visitorDf['status'].str.contains('Valid Card Exit')])
visitorIn = int(visitorIn)
visitorOut = int(visitorOut)
totalVisitor = visitorIn1 + visitorIn - visitorOut
#Making Contractor Counter
contractorDf = df[df['type'].str.startswith('PC')]
#contractorDf = contractorDf[~contractorDf['status'].str.contains('Unknown')]
contractorIn1 = len(contractorDf[contractorDf['status'].str.contains('Unknown')])
contractorIn1 = int(contractorIn1)
contractorDf = contractorDf.reset_index()
contractorDf = contractorDf.drop(('index'), axis=1)
contractorIn = len(contractorDf[contractorDf['status'].str.contains('Valid Card Entry')])
contractorOut = len(contractorDf[contractorDf['status'].str.contains('Valid Card Exit')])
contractorIn = int(contractorIn)
contractorOut = int(contractorOut)
totalContractor = contractorIn1 + contractorIn - contractorOut
#Making Employee Counter
employeeDf = df[df['type'].str.contains('^\d', regex=True)]
#employeeDf = employeeDf[~employeeDf['status'].str.contains('Unknown')]
employeeIn1 = len(employeeDf[employeeDf['status'].str.contains('Unknown')])
employeeIn1 = int(employeeIn1)
employeeDf = employeeDf.reset_index()
employeeDf = employeeDf.drop(('index'), axis=1)
employeeIn = len(employeeDf[employeeDf['status'].str.contains('Valid Card Entry')])
employeeOut = len(employeeDf[employeeDf['status'].str.contains('Valid Card Exit')])
employeeIn = int(employeeIn)
employeeOut = int(employeeOut)
totalEmployee = employeeIn1 + employeeIn - employeeOut
os.remove('C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv')
visitor = totalVisitor
employee = totalEmployee
contractor = totalContractor
if os.path.exists('C:\\Apache24\\htdocs\\counter\\index.html'):
os.remove('c:\\Apache24\\htdocs\\counter\\index.html')
else:
pass
f = open('C:\\Apache24\\htdocs\\counter\\index.html', 'w')
message = """
<html lang="en-US" class="hide-scroll">
<head>
<title>Emhart Counter</title>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" integrity="sha384-JcKb8q3iqJ61gNV9KGb8thSsNjpSL0n8PARn9HuZOnIxN0hoP+VmmDGMN5t9UJ0Z" crossorigin="anonymous">
<style>
body {{
background-color: lightblue;
}}
.verticalCenter {{
margin: 0;
top: 100%;
-ms-transform: translateY(25%);
transform: translateY(25%);
}}
</style>
</head>
<body>
<center>
<div class=“verticalCenter">
<h1 style=font-size:100px>VISITORS:        {visitor}</h1><br></br><br></br>
<h1 style=font-size:100px>EMPLOYEES:        {employee}</h1><br></br><br></br>
<h1 style=font-size:100px>CONTRACTORS:        {contractor}</h1><br></br><br></br><br></br><br></br>
<h3 style=font-size: 50px>THIS IS A TEST RUN<h3>
</div>
</center>
</body>
</html>"""
new_message = message.format(visitor=visitor, employee=employee, contractor=contractor)
f.write(new_message)
f.close()
sys.exit()
剩下的唯一问题是,如何在carderor / df中具有对应条目的Cardno / type之前退出出口。我真的很感激对此事的任何帮助。
答案 0 :(得分:0)
例如,startswith和endswith可以工作。对于更复杂的正则表达式模式,请使用contains。
mask = df.date.str.startswith("**")
print(df[mask])
# or
mask = df.status.str.endswith("**")
print(df[mask])
输出:
date timestamp type cardno status
0 **20201006 55737 PC010 117016056 Valid_Card_Exit**
3 **20201006 64952 PC012 116054003 Valid_Card_Exit**
设置:
columns = ['date','timestamp','type','cardno','status']
data = [el.split(",") for el in ['**20201006,55737,PC010,117016056,Valid_Card_Exit**',
'20201006,55907,PC010,117016056,Valid_Card_Entry',
'20201006,64834,PC011,117016074,Valid_Card_Entry',
'**20201006,64952,PC012,116054003,Valid_Card_Exit**']]
df = pd.DataFrame(data, columns=columns)
答案 1 :(得分:0)
该问题的关键是一个常见的数学技巧。我们首先将entry
视为1
,并将exit
视为取消条目,即-1
。然后,退出事件很糟糕,如果它首先在该行之前产生负的累积总和(cumsum
)。即,当退出事件发生时,不能解释为对先前条目的适当取消。但是,请注意,后续的cumsum
负值可能是由先前的错误值引起的。因此,我们仅将第一个负的累积值确定为错误。
基于上述观察,可以递归的方式找到每张卡的第一个错误条目,直到没有负的累积值产生为止。
该实现演示了如何递归执行此操作。对于大型数据集,它还没有进行优化,但是优点应该是相似的。
# initialize
df["retain"] = True
df["delta"] = -1
df.loc[df["status"] == "Valid Card Entry", "delta"] = 1
def recurse(df):
# sort for cumsum (bad values found were not retained)
df_sorted = df[df["retain"]].sort_values(by=["cardno", "timestamp"]).reset_index(drop=True)
# cumsum
df_sorted["cumsum"] = df_sorted[["cardno", "delta"]].groupby("cardno").cumsum()
# get the first occurrence of negative cumsum
df_dup = df_sorted[df_sorted["cumsum"] < 0].groupby("cardno").first()
# termination condition: no more bad values were found
if len(df_dup) == 0:
return
# else, remove the bad rows
for cardno, row in df_dup.iterrows():
df.loc[(df["cardno"] == cardno) & (df["timestamp"] == row["timestamp"]), "retain"] = False
# execute
recurse(df)
del df["delta"] # optional cleanup
请参阅“保留”列(False
=错误的出口)。
df
Out[61]:
date timestamp type cardno status retain
0 20201006 55737 PC010 117016056 Valid Card Exit False
1 20201006 55907 PC010 117016056 Valid Card Entry True
2 20201006 60312 PC006 100024021 Valid Card Entry True
3 20201006 61311 PC006 100024021 Valid Card Exit True
4 20201006 61445 PC006 100024021 Valid Card Entry True
5 20201006 61538 PC006 100024021 Valid Card Exit True
6 20201006 61646 PC010 117016056 Valid Card Exit True
7 20201006 61933 PC006 100024021 Valid Card Entry True
8 20201006 61938 PC010 117016056 Valid Card Entry True
9 20201006 62025 PC006 100024021 Valid Card Exit True
10 20201006 62041 PC010 117016056 Valid Card Exit True
11 20201006 62042 PC006 100024021 Valid Card Entry True
12 20201006 62225 PC010 117016056 Valid Card Entry True
13 20201006 62527 PC006 100024021 Valid Card Exit True
14 20201006 63018 PC006 100024021 Valid Card Entry True
15 20201006 64832 PC007 116057383 Valid Card Entry True
16 20201006 64834 PC011 117016074 Valid Card Entry True
17 20201006 64952 PC012 116054003 Valid Card Exit False
出于演示目的,下面显示了清理前后的cumsum
s。数据集按(cardno, timestamp)
排序,为清楚起见,删除了date
列。
之前
df_sorted
Out[69]:
timestamp type cardno status retain delta cumsum
0 60312 PC006 100024021 Valid Card Entry True 1 1
1 61311 PC006 100024021 Valid Card Exit True -1 0
2 61445 PC006 100024021 Valid Card Entry True 1 1
3 61538 PC006 100024021 Valid Card Exit True -1 0
4 61933 PC006 100024021 Valid Card Entry True 1 1
5 62025 PC006 100024021 Valid Card Exit True -1 0
6 62042 PC006 100024021 Valid Card Entry True 1 1
7 62527 PC006 100024021 Valid Card Exit True -1 0
8 63018 PC006 100024021 Valid Card Entry True 1 1
9 64952 PC012 116054003 Valid Card Exit True -1 -1
10 64832 PC007 116057383 Valid Card Entry True 1 1
11 55737 PC010 117016056 Valid Card Exit True -1 -1
12 55907 PC010 117016056 Valid Card Entry True 1 0
13 61646 PC010 117016056 Valid Card Exit True -1 -1
14 61938 PC010 117016056 Valid Card Entry True 1 0
15 62041 PC010 117016056 Valid Card Exit True -1 -1
16 62225 PC010 117016056 Valid Card Entry True 1 0
17 64834 PC011 117016074 Valid Card Entry True 1 1
之后
df_sorted
Out[73]:
timestamp type cardno status retain delta cumsum
0 60312 PC006 100024021 Valid Card Entry True 1 1
1 61311 PC006 100024021 Valid Card Exit True -1 0
2 61445 PC006 100024021 Valid Card Entry True 1 1
3 61538 PC006 100024021 Valid Card Exit True -1 0
4 61933 PC006 100024021 Valid Card Entry True 1 1
5 62025 PC006 100024021 Valid Card Exit True -1 0
6 62042 PC006 100024021 Valid Card Entry True 1 1
7 62527 PC006 100024021 Valid Card Exit True -1 0
8 63018 PC006 100024021 Valid Card Entry True 1 1
9 64832 PC007 116057383 Valid Card Entry True 1 1
10 55907 PC010 117016056 Valid Card Entry True 1 1
11 61646 PC010 117016056 Valid Card Exit True -1 0
12 61938 PC010 117016056 Valid Card Entry True 1 1
13 62041 PC010 117016056 Valid Card Exit True -1 0
14 62225 PC010 117016056 Valid Card Entry True 1 1
15 64834 PC011 117016074 Valid Card Entry True 1 1