通过

时间:2016-11-17 23:03:21

标签: python pandas dataframe

This is the closest to what i'm looking for that I've found

让我们说我的数据框看起来像这样:

d = {'item_number':['K208UL','AKD098008','DF900A','K208UL','AKD098008']
     'Comp_ID':['998798098','988797387','12398787','998798098','988797387']
     'date':['2016-11-12','2016-11-13','2016-11-17','2016-11-13','2016-11-14']}

df = pd.DataFrame(data=d)

我想计算连续几天观察到相同item_numberComp_ID的次数。

我想这会看起来像是:

g = df.groupby(['Comp_ID','item_number'])
g.apply(lambda x: x.loc[x.iloc[i,'date'].shift(-1) - x.iloc[i,'date'] == 1].count())

但是,在比较之前,我需要从每个日期中抽取一天作为int,我也遇到了麻烦。

for i in df.index:
    wbc_seven.iloc[i, 'day_column'] = datetime.datetime.strptime(df.iloc[i,'date'],'%Y-%m-%d').day

显然,基于位置的索引只允许整数?我怎么能解决这个问题?

2 个答案:

答案 0 :(得分:0)

一种解决方案是使用数据透视表来计算连续几天观察Comp_IDitem_number的次数。

import pandas as pd

d = {'item_number':['K208UL','AKD098008','DF900A','K208UL','AKD098008'],'Comp_ID':['998798098','988797387','12398787','998798098','988797387'],'date':['2016-11-12','2016-11-13','2016-11-17','2016-11-13','2016-11-14']}

df = pd.DataFrame(data=d).sort_values(['item_number','Comp_ID'])
df['date'] = pd.to_datetime(df['date'])
df['delta'] = (df['date'] - df['date'].shift(1))
df = df[(df['delta']=='1 days 00:00:00.000000000') & (df['Comp_ID'] == df['Comp_ID'].shift(1)) & 
        (df['item_number'] == df['item_number'].shift(1))].pivot_table( index=['item_number','Comp_ID'],
           values=['date'],aggfunc='count').reset_index()
df.rename(columns={'date':'consecutive_days'},inplace =True)

结果

  item_number    Comp_ID  consecutive_days
0   AKD098008  988797387                 1
1      K208UL  998798098                 1 

答案 1 :(得分:0)

  

但是,我需要从每个日期中提取一天作为int   在比较之前,我也遇到了麻烦。

为什么?

要修复代码,您需要:

void get_name()
{
  char first[80];
  char second[80];
  char fullname[80]; // an array of chars instead of pointers

  printf("Please enter first name: ");
  scanf("%s", first); // not taking the address of first - is already an address

  printf("\nPlease enter last name: ");
  scanf("%s", second); // not taking the address of second - is already an address

  strcpy(fullname, first); // don't dereference fullname
  strcat(fullname, " "); // don't dereference fullname
  strcat(fullname, second); // don't dereference fullname

  printf("\n\nFull name is : %s ", fullname); // don't dereference fullname

}

请注意以下事项:

  1. 上面的代码避免重复。这是一个基本的编程原则:Don't Repeat Yourself
  2. 将1转换为consecutive['date'] = pd.to_datetime(consecutive['date']) g = consecutive.groupby(['Comp_ID','item_number']) g['date'].apply(lambda x: sum(abs((x.shift(-1) - x)) == pd.to_timedelta(1, unit='D'))) 以进行正确比较。
  3. 这取决于绝对差异。
  4. 提示,为您的作品编写顶级函数,而不是timedelta,因为它具有更好的可读性,简洁性和美观性:

    lambda

    说明:

    这很简单。日期为converted to Timestamp type,然后减去。差异将导致timedelta,还需要与def differencer(grp, day_dif): """Counts rows in grp separated by day_dif day(s)""" d = abs(grp.shift(-1) - grp) return sum(d == pd.to_timedelta(day_dif, unit='D')) g['date'].apply(differencer, day_dif=1) 对象进行比较,从而将1(或timedelta)转换为day_dif。转换的结果将是布尔系列。布尔值由timedelta表示为0,False表示1。布尔系列的总和将返回系列中True值的总数。