给出的内容:
目标是为每个30分钟细分计算每个department-building_type对的男性员工数。
以下是要处理的 CSV 数据:
time,department,building_type,gender
2017-09-07 14:46:14,018,management,b,m
2017-09-07 14:49:14,081,it,a,m
2017-09-07 14:55:14,127,management,c,f
2017-09-07 15:40:16,318,marketing,c,m
2017-09-07 16:01:14,018,it,a,m
2017-09-07 16:10:14,081,it,a,m
2017-09-07 17:46:14,127,marketing,c,m
2017-09-07 17:49:16,318,management,c,m
2017-09-07 18:00:14,018,it,c,f
2017-09-07 18:02:14,081,management,a,m
2017-09-07 18:33:14,127,marketing,b,m
2017-09-07 18:56:16,318,marketing,a,m
处理的主要时间段为2017-09-07 14:46:14,018至2017-09-07 18:56:16,318。
在此主要时间段内,应定义30分钟段,并且应为每30分钟段计算每个department-building_type对的男性员工人数。
< / LI>输出应包含一列start_time
,表示30分钟段的开头,其中应计算每个部门建筑对的男性员工人数。
输出应显示在终端中(不需要csv格式)
输出示例:
start_time,department,building_type,num_of_m_employees
2017-09-07 14:46:14,018,management,b,2
2017-09-07 14:46:14,018,it,a,1
2017-09-07 15:40:16,318,marketing,c,1
2017-09-07 15:40:16,318,it,a,2
2017-09-07 17:46:14,127,marketing,c,1
2017-09-07 17:46:14,127,management,a,1
2017-09-07 18:33:14,127,marketing,b,1
2017-09-07 18:33:14,127,marketing,a,1
我编写了一个程序来计算每个部门建筑对的男性雇员数量,但我不能对每个30分钟的部分做同样的事情。我怎么编辑它?:
import csv
from collections import Counter
with open('test.csv') as f:
cnt = Counter()
reader = csv.reader(f)
for row in reader:
if row[3] == "m":
cnt[row[2], row[3]] += 1
print(cnt)
答案 0 :(得分:0)
这有望让你开始:
import csv
from collections import Counter
from datetime import datetime, timedelta
with open('test.csv') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
start_time = None
thirty_mins = timedelta(minutes=30)
cnt = Counter()
for row in csv_input:
# Convert the first entry into a datetime object
dt = datetime.strptime("{} {:06}".format(row[0], int(row[1]) * 1000), '%Y-%m-%d %H:%M:%S %f')
if start_time == None:
start_time = dt
if dt >= start_time + thirty_mins:
for (dept, type), count in cnt.items():
print('{} {:03},{},{},{}'.format(start_time.strftime('%Y-%m-%d %H:%M:%S'), start_time.microsecond//1000, dept, type, count))
start_time += thirty_mins
cnt = Counter()
if row[4] == "m":
cnt[row[2], row[3]] += 1
for (dept, type), count in cnt.items():
print('{} {:03},{},{},{}'.format(start_time.strftime('%Y-%m-%d %H:%M:%S'), start_time.microsecond//1000, dept, type, count))
想法是将时间转换为datetime
对象。通过它,您可以确定行是否落在下一个30分钟的边界内。您的第二列似乎包含毫秒。日期时间格式使用微秒,因此需要进行转换和添加。
读取每一行并转换时间。接下来确定我们是否已超过30分钟的边界。如果是,则显示该边界的计数器值并重置计数器。将时间边界提前30分钟。否则,如果该行是男性,请添加到计数器。
最后,打印最后一个边界的剩余条目。
对于你给出的例子,这会给你:
2017-09-07 14:46:14 018,management,b,1
2017-09-07 14:46:14 018,it,a,1
2017-09-07 15:16:14 018,marketing,c,1
2017-09-07 15:46:14 018,it,a,2
2017-09-07 16:16:14 018,marketing,c,1
2017-09-07 16:46:14 018,management,c,1
2017-09-07 17:46:14 018,management,a,1
2017-09-07 18:16:14 018,marketing,b,1
2017-09-07 18:46:14 018,marketing,a,1
注意,某些边界不包含任何条目。