按日期拆分数据框并计算每个日期的所有行的中位数

时间:2017-12-08 03:20:24

标签: python pandas csv date median

我试图粗略估计工作量,而不是工作人员在一个月内完成的工作量。

我有一个看起来大致相似的csv(尽管它更大):

+--------+-------+---------------+
|  Date  | Name  | Units of Work |
+--------+-------+---------------+
| 1/1/17 | Bob   |           450 |
| 2/1/17 | Alice |           300 |
| 2/1/17 | Bob   |           450 |
| 2/1/17 | Larry |            50 |
| 3/1/17 | Alice |           400 |
| 3/1/17 | Bob   |            11 |
| 3/1/17 | Larry |           100 |
| 4/1/17 | Alice |          1000 |
| 4/1/17 | Bob   |           240 |
| 4/1/17 | Larry |            33 |
+--------+-------+---------------+

我想:

  1. 计算每个“日期”的中位数“工作单位”
  2. 确定“姓名”是否少于该“日期”中位数“工作单位”的20%
  3. 如果“名称”的数量少于中位数的20%
  4. ,请将其删除
  5. 将“日期”左侧的“姓名”计数乘以“日期”的中位数“工作单位”
  6. 输出一个新的csv,每个'日期',只出现一次,在它自己的行上,并且该日期的中位数'工作单位'乘以'日期'的剩余'名称'
  7. 我甚至无法满足要求1,更不用说2到5.我为每个日期获取了一个文件。而不是具有中位数的列,我得到一个名为'NewColumn'的新列,其中填充了'median'一词,如下所示:

    # -*- coding: utf-8 -*-
    import pandas as pd
    df = pd.read_csv('source.csv')
    df = df.sort_values('date_trunc').assign(NewColumn='median')
    df.median(axis=None, skipna=None, level=None, numeric_only=None)
    for i, g in df.groupby('date_trunc'):
        g.to_csv('{}.csv'.format(i), header=True, index_label=False, index=False)
        +---------+-------+---------------+-----------+
        |  Date   | Name  | Units of work | NewColumn |
        +---------+-------+---------------+-----------+
        | 12/1/16 | Alice |          6222 | median    |
        | 12/1/16 | Bob   |         14530 | median    |
        | 12/1/16 | Larry |         16887 | median    |
        +---------+-------+---------------+-----------+
    

    我知道我在这里可能做错很多,但我真的很感激一些指导。

    我最终想要的是一个单独的csv:

    +---------+--------+
    |  Date   | Median |
    +---------+--------+
    | 12/1/16 |   1110 |
    | 1/1/17  |   1400 |
    | 2/1/17  |   1200 |
    +---------+--------+
    

2 个答案:

答案 0 :(得分:0)

我大约80%确定我没有完全理解目标,但这是我的尝试。

import pandas as pd

df = pd.DataFrame({"Date": ["Jan-12", "Jan-12"], "Name": ["Bob", "Alice"], "Work": [400, 300]})

def extract_rows_with_date(df, date):
    return df[df["Date"] == date]

# Extract unique dates
dates = df.Date.unique()

# Creating an empty dataframe dictionary (you get it)
new_df = {"Date": [], "Median": []}

for date in dates:
    # Fun stuff here
    date_df = extract_rows_with_date(df, date)
    median = date_df["Work"].median()

    above_20_median = date_df[date_df["Work"] > (median*20)/100]

    count_above_median = above_20_median.shape[0]

    new_df["Date"].append(date)
    new_df["Median"].append(count_above_median * median)


new_df = pd.DataFrame(new_df)
print(new_df.head())

答案 1 :(得分:0)

我希望以下步骤让您更接近所需的CSV输出。

首先,对于希望复制粘贴到pd.read_clipboard()的任何其他人来说,这是输入DataFrame的简洁再现:

     Date     Name     Units of Work
0   Jan-17    Bob               450.0
1   Feb-17    Alice             300.0
2   Feb-17    Bob               450.0
3   Feb-17    Larry              50.0
4   Mar-17    Alice             400.0
5   Mar-17    Bob                11.0
6   Mar-17    Larry             100.0
7   Apr-17    Alice            1000.0
8   Apr-17    Bob               240.0
9   Apr-17    Larry              33.0

0。将日期转换为python datetime(对于合理的排序顺序)

# Docs on Python datetime format strings: https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
df['Date'] = pd.to_datetime(df['Date'].apply(lambda x: x.strip()), format='%b-%y')

1。对于每个日期,找到工作中位数

meds = df.groupby('Date')[['Units of Work']].median()
meds
    Units of Work
Date    
2017-01-01  450.0
2017-02-01  300.0
2017-03-01  100.0
2017-04-01  240.0

2,3。删除工作单位<&lt;该日期的中位数工作单位的20%

# Set an index on which to merge the medians
df2 = df.set_index('Date')
# Pandas is smart enough to merge the 4-row meds DataFrame onto the 10-row df2 DataFrame based on matching index values
df2['Median'] = meds 

# Build a boolean mask to pick out "hard workers" and "slackers"
mask = df2['Units of Work'] >= 0.2 * df2['Median']

# "Hard workers," where units of work >= 20% of that date's median
df2[mask]
               Name  Units of Work  Median
Date                                      
2017-01-01   Bob             450.0   450.0
2017-02-01   Alice           300.0   300.0
2017-02-01   Bob             450.0   300.0
2017-03-01   Alice           400.0   100.0
2017-03-01   Larry           100.0   100.0
2017-04-01   Alice          1000.0   240.0
2017-04-01   Bob             240.0   240.0

# Bonus: "slackers," where units of work < 20% of that date's median
df2[~mask]
               Name  Units of Work  Median
Date                                      
2017-02-01   Larry            50.0   300.0
2017-03-01   Bob              11.0   100.0
2017-04-01   Larry            33.0   240.0

4。对于每个日期,将“勤奋工作者”的数量乘以中位数工作单位

df2[mask].groupby('Date').size().mul(meds['Units of Work'])
2017-01-01    450.0
2017-02-01    600.0
2017-03-01    200.0
2017-04-01    480.0