我的数据框如下:
Date ID
2014-12-31 1
2014-12-31 2
2014-12-31 3
2014-12-31 4
2014-12-31 5
2014-12-31 6
2014-12-31 7
2015-01-01 1
2015-01-01 2
2015-01-01 3
2015-01-01 4
2015-01-01 5
2015-01-02 1
2015-01-02 3
2015-01-02 7
2015-01-02 9
我想要做的是确定该日期独有的一个日期的ID与另一个日期的值。
示例1:结果df将是2014-12-31中的独占ID与2015-01-01中的ID以及2015-01-01 vs.中的独占ID 2015-01-02中的ID:
2015-01-01 6
2015-01-01 7
2015-01-02 2
2015-01-02 4
2015-01-02 6
我想“选择”比较多少天“回来”。例如,我可以输入一个变量daysback=1
,每天都会与之前的变量进行比较。或者我可以输入变量daysback=2
,每天都会比较两天前。等
在df.groupby('Date')
之外,我不知道该怎么做。可能使用diff()
?
答案 0 :(得分:1)
我假设"日期"在您的DataFrame中是:1)日期对象和2)不是索引。
如果这些假设是错误的,那么这会改变一些事情。
import datetime
from datetime import timedelta
def find_unique_ids(df, date, daysback=1):
date_new = date
date_old = date - timedelta(days = daysback)
ids_new = df[df['Date'] == date_new]['ID']
ids_old = df[df['Date'] == date_old]['ID']
return df.iloc[ids_new[-ids_new.isin(ids_old)]]
date = datetime.date(2015, 1, 2)
daysback = 1
print find_unique_ids(df, date, daysback)
运行它会产生以下输出:
Date ID
7 2015-01-01 1
9 2015-01-01 3
如果日期 你的索引字段,那么你需要在函数中修改两行:
ids_new = df.ix[date_new]['ID']
ids_old = df.ix[date_old]['ID']
输出:
ID
Date
2015-01-01 1
2015-01-01 3
编辑:
这有点脏,但它应该完成你想做的事情。我在内联添加了评论,解释了发生了什么。如果这是您定期运行或跨越大量数据的话,可能会采用更简洁,更有效的方法。
def find_unique_ids(df,daysback):
# We need both Date and ID to both be either fields or index fields -- no mix/match.
df = df.reset_index()
# Calculate DateComp by adding our daysback value as a timedelta
df['DateComp'] = df['Date'].apply(lambda dc: dc + timedelta(days=daysback))
# Join df back on to itself, SQL style LEFT OUTER.
df2 = pd.merge(df,df, left_on=['DateComp','ID'], right_on=['Date','ID'], how='left')
# Create series of missing_id values from the right table
missing_ids = (df2['Date_y'].isnull())
# Create series of valid DateComp values.
# DateComp is the "future" date that we're comparing against. Without this
# step, all records on the last Date value will be flagged as unique IDs.
valid_dates = df2['DateComp_x'].isin(df['Date'].unique())
# Use those to find missing IDs and valid dates. Create a new output DataFrame.
output = df2[(valid_dates) & (missing_ids)][['DateComp_x','ID']]
# Rename columns of output and return
output.columns = ['Date','ID']
return output
测试输出:
Date ID
5 2015-01-01 6
6 2015-01-01 7
8 2015-01-02 2
10 2015-01-02 4
11 2015-01-02 5
编辑:
missing_ids=df2[df2['Date_y'].isnull()] #gives the whole necessary dataframe
答案 1 :(得分:1)
将列表应用于聚合的另一种方法,
<%- model_class = Post -%>
<table class="table table-striped">
<thead>
<tr>
<th><%= model_class.human_attribute_name(:created_at) %></th>
</tr>
</thead>
<tbody>
<% @posts.each_with_index do |post, index| %>
<tr>
<td><%=l post.created_at.strftime("Posted on %B %d, %Y at %H:%M") %>
</td>
</td>
</tr>
<% end %>
</tbody>
在功能中
df
Out[146]:
Date Unnamed: 2
0 2014-12-31 1
1 2014-12-31 2
2 2014-12-31 3
3 2014-12-31 4
4 2014-12-31 5
5 2014-12-31 6
6 2014-12-31 7
7 2015-01-01 1
8 2015-01-01 2
9 2015-01-01 3
10 2015-01-01 4
11 2015-01-01 5
12 2015-01-02 1
13 2015-01-02 3
14 2015-01-02 7
15 2015-01-02 9
abbs = df.groupby(['Date'])['Unnamed: 2'].apply(list)
abbs
Out[142]:
Date
2014-12-31 [1, 2, 3, 4, 5, 6, 7]
2015-01-01 [1, 2, 3, 4, 5]
2015-01-02 [1, 3, 7, 9]
Name: Unnamed: 2, dtype: object
abbs.loc['2015-01-01']
Out[143]: [1, 2, 3, 4, 5]
list(set(abbs.loc['2014-12-31']) - set(abbs.loc['2015-01-01']))
Out[145]: [6, 7]
您可以编写一个函数并使用日期而不是str:)