我试图粗略估计工作量,而不是工作人员在一个月内完成的工作量。
我有一个看起来大致相似的csv(尽管它更大):
+--------+-------+---------------+
| Date | Name | Units of Work |
+--------+-------+---------------+
| 1/1/17 | Bob | 450 |
| 2/1/17 | Alice | 300 |
| 2/1/17 | Bob | 450 |
| 2/1/17 | Larry | 50 |
| 3/1/17 | Alice | 400 |
| 3/1/17 | Bob | 11 |
| 3/1/17 | Larry | 100 |
| 4/1/17 | Alice | 1000 |
| 4/1/17 | Bob | 240 |
| 4/1/17 | Larry | 33 |
+--------+-------+---------------+
我想:
我甚至无法满足要求1,更不用说2到5.我为每个日期获取了一个文件。而不是具有中位数的列,我得到一个名为'NewColumn'的新列,其中填充了'median'一词,如下所示:
# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('date_trunc').assign(NewColumn='median')
df.median(axis=None, skipna=None, level=None, numeric_only=None)
for i, g in df.groupby('date_trunc'):
g.to_csv('{}.csv'.format(i), header=True, index_label=False, index=False)
+---------+-------+---------------+-----------+
| Date | Name | Units of work | NewColumn |
+---------+-------+---------------+-----------+
| 12/1/16 | Alice | 6222 | median |
| 12/1/16 | Bob | 14530 | median |
| 12/1/16 | Larry | 16887 | median |
+---------+-------+---------------+-----------+
我知道我在这里可能做错很多,但我真的很感激一些指导。
我最终想要的是一个单独的csv:
+---------+--------+
| Date | Median |
+---------+--------+
| 12/1/16 | 1110 |
| 1/1/17 | 1400 |
| 2/1/17 | 1200 |
+---------+--------+
答案 0 :(得分:0)
我大约80%确定我没有完全理解目标,但这是我的尝试。
import pandas as pd
df = pd.DataFrame({"Date": ["Jan-12", "Jan-12"], "Name": ["Bob", "Alice"], "Work": [400, 300]})
def extract_rows_with_date(df, date):
return df[df["Date"] == date]
# Extract unique dates
dates = df.Date.unique()
# Creating an empty dataframe dictionary (you get it)
new_df = {"Date": [], "Median": []}
for date in dates:
# Fun stuff here
date_df = extract_rows_with_date(df, date)
median = date_df["Work"].median()
above_20_median = date_df[date_df["Work"] > (median*20)/100]
count_above_median = above_20_median.shape[0]
new_df["Date"].append(date)
new_df["Median"].append(count_above_median * median)
new_df = pd.DataFrame(new_df)
print(new_df.head())
答案 1 :(得分:0)
我希望以下步骤让您更接近所需的CSV输出。
首先,对于希望复制粘贴到pd.read_clipboard()
的任何其他人来说,这是输入DataFrame的简洁再现:
Date Name Units of Work
0 Jan-17 Bob 450.0
1 Feb-17 Alice 300.0
2 Feb-17 Bob 450.0
3 Feb-17 Larry 50.0
4 Mar-17 Alice 400.0
5 Mar-17 Bob 11.0
6 Mar-17 Larry 100.0
7 Apr-17 Alice 1000.0
8 Apr-17 Bob 240.0
9 Apr-17 Larry 33.0
# Docs on Python datetime format strings: https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
df['Date'] = pd.to_datetime(df['Date'].apply(lambda x: x.strip()), format='%b-%y')
meds = df.groupby('Date')[['Units of Work']].median()
meds
Units of Work
Date
2017-01-01 450.0
2017-02-01 300.0
2017-03-01 100.0
2017-04-01 240.0
# Set an index on which to merge the medians
df2 = df.set_index('Date')
# Pandas is smart enough to merge the 4-row meds DataFrame onto the 10-row df2 DataFrame based on matching index values
df2['Median'] = meds
# Build a boolean mask to pick out "hard workers" and "slackers"
mask = df2['Units of Work'] >= 0.2 * df2['Median']
# "Hard workers," where units of work >= 20% of that date's median
df2[mask]
Name Units of Work Median
Date
2017-01-01 Bob 450.0 450.0
2017-02-01 Alice 300.0 300.0
2017-02-01 Bob 450.0 300.0
2017-03-01 Alice 400.0 100.0
2017-03-01 Larry 100.0 100.0
2017-04-01 Alice 1000.0 240.0
2017-04-01 Bob 240.0 240.0
# Bonus: "slackers," where units of work < 20% of that date's median
df2[~mask]
Name Units of Work Median
Date
2017-02-01 Larry 50.0 300.0
2017-03-01 Bob 11.0 100.0
2017-04-01 Larry 33.0 240.0
df2[mask].groupby('Date').size().mul(meds['Units of Work'])
2017-01-01 450.0
2017-02-01 600.0
2017-03-01 200.0
2017-04-01 480.0