Pandas Dataframe Groupby确定1组与另一组的值

时间:2015-10-31 18:26:39

标签: python pandas group-by

我的数据框如下:

Date        ID
2014-12-31  1
2014-12-31  2
2014-12-31  3
2014-12-31  4
2014-12-31  5
2014-12-31  6
2014-12-31  7
2015-01-01  1
2015-01-01  2
2015-01-01  3
2015-01-01  4
2015-01-01  5
2015-01-02  1
2015-01-02  3
2015-01-02  7
2015-01-02  9

我想要做的是确定该日期独有的一个日期的ID与另一个日期的值。

  

示例1:结果df将是2014-12-31中的独占ID与2015-01-01中的ID以及2015-01-01 vs.中的独占ID 2015-01-02中的ID:

   2015-01-01  6 
   2015-01-01  7
   2015-01-02  2
   2015-01-02  4
   2015-01-02  6

我想“选择”比较多少天“回来”。例如,我可以输入一个变量daysback=1,每天都会与之前的变量进行比较。或者我可以输入变量daysback=2,每天都会比较两天前。等

df.groupby('Date')之外,我不知道该怎么做。可能使用diff()

2 个答案:

答案 0 :(得分:1)

我假设"日期"在您的DataFrame中是:1)日期对象和2)不是索引。

如果这些假设是错误的,那么这会改变一些事情。

import datetime
from datetime import timedelta

def find_unique_ids(df, date, daysback=1):

    date_new = date
    date_old = date - timedelta(days = daysback)

    ids_new = df[df['Date'] == date_new]['ID']
    ids_old = df[df['Date'] == date_old]['ID'] 

    return df.iloc[ids_new[-ids_new.isin(ids_old)]]

date = datetime.date(2015, 1, 2)
daysback = 1

print find_unique_ids(df, date, daysback)

运行它会产生以下输出:

        Date  ID
7 2015-01-01   1
9 2015-01-01   3

如果日期 你的索引字段,那么你需要在函数中修改两行:

ids_new = df.ix[date_new]['ID']
ids_old = df.ix[date_old]['ID'] 

输出:

            ID
Date          
2015-01-01   1
2015-01-01   3

编辑:

这有点脏,但它应该完成你想做的事情。我在内联添加了评论,解释了发生了什么。如果这是您定期运行或跨越大量数据的话,可能会采用更简洁,更有效的方法。

def find_unique_ids(df,daysback):

    # We need both Date and ID to both be either fields or index fields -- no mix/match.
    df = df.reset_index() 

    # Calculate DateComp by adding our daysback value as a timedelta
    df['DateComp'] = df['Date'].apply(lambda dc: dc + timedelta(days=daysback))

    # Join df back on to itself, SQL style LEFT OUTER.
    df2 = pd.merge(df,df, left_on=['DateComp','ID'], right_on=['Date','ID'], how='left')

    # Create series of missing_id values from the right table
    missing_ids = (df2['Date_y'].isnull())

    # Create series of valid DateComp values. 
    # DateComp is the "future" date that we're comparing against. Without this
    # step, all records on the last Date value will be flagged as unique IDs.
    valid_dates = df2['DateComp_x'].isin(df['Date'].unique())

    # Use those to find missing IDs and valid dates. Create a new output DataFrame.
    output = df2[(valid_dates) & (missing_ids)][['DateComp_x','ID']]

    # Rename columns of output and return
    output.columns = ['Date','ID']
    return output

测试输出:

         Date  ID
5  2015-01-01   6
6  2015-01-01   7
8  2015-01-02   2
10 2015-01-02   4
11 2015-01-02   5

编辑:

missing_ids=df2[df2['Date_y'].isnull()] #gives the whole necessary dataframe

答案 1 :(得分:1)

将列表应用于聚合的另一种方法,

<%- model_class = Post -%>
<table class="table table-striped">
<thead>      
<tr>
<th><%= model_class.human_attribute_name(:created_at) %></th>
</tr>
</thead>
<tbody>
<% @posts.each_with_index do |post, index| %>
<tr> 
<td><%=l post.created_at.strftime("Posted on %B %d, %Y at %H:%M") %>
</td>
</td>
</tr>
<% end %>
</tbody>

在功能中

df
Out[146]: 
          Date  Unnamed: 2
0   2014-12-31           1
1   2014-12-31           2
2   2014-12-31           3
3   2014-12-31           4
4   2014-12-31           5
5   2014-12-31           6
6   2014-12-31           7
7   2015-01-01           1
8   2015-01-01           2
9   2015-01-01           3
10  2015-01-01           4
11  2015-01-01           5
12  2015-01-02           1
13  2015-01-02           3
14  2015-01-02           7
15  2015-01-02           9

abbs = df.groupby(['Date'])['Unnamed: 2'].apply(list)

abbs
Out[142]: 
Date
2014-12-31    [1, 2, 3, 4, 5, 6, 7]
2015-01-01          [1, 2, 3, 4, 5]
2015-01-02             [1, 3, 7, 9]
Name: Unnamed: 2, dtype: object

abbs.loc['2015-01-01']
Out[143]: [1, 2, 3, 4, 5]

list(set(abbs.loc['2014-12-31']) - set(abbs.loc['2015-01-01']))
Out[145]: [6, 7]

您可以编写一个函数并使用日期而不是str:)