我有一个pandas数据框,其日期时间索引如下所示:
df =
Fruit Quantity
01/02/10 Apple 4
01/02/10 Apple 6
01/02/10 Pear 7
01/02/10 Grape 8
01/02/10 Grape 5
02/02/10 Apple 2
02/02/10 Fruit 6
02/02/10 Pear 8
02/02/10 Pear 5
现在,对于每个日期和每个水果,我只想要一个值(最好是前一个)和日期的其余水果保持为零。所以期望的输出如下:
Fruit Quantity
01/02/10 Apple 4
01/02/10 Apple 0
01/02/10 Pear 7
01/02/10 Grape 8
01/02/10 Grape 0
02/02/10 Apple 2
02/02/10 Fruit 6
02/02/10 Pear 8
02/02/10 Pear 0
这只是一个小例子,但我的主数据框有超过300万行,并且每个日期的结果不一定是正确的。
由于
答案 0 :(得分:1)
您可以使用:
m = df.rename_axis('Date').groupby(['Date', 'Fruit']).cumcount().eq(0)
df['Quantity'] = df['Quantity'].where(m, 0)
print (df)
Fruit Quantity
01/02/10 Apple 4
01/02/10 Apple 0
01/02/10 Pear 7
01/02/10 Grape 8
01/02/10 Grape 0
02/02/10 Apple 2
02/02/10 Fruit 6
02/02/10 Pear 8
02/02/10 Pear 0
使用reset_index
的另一个解决方案,但是必须将{boo}转换为numpy数组values
,因为不同的索引:
m = df.reset_index().groupby(['index', 'Fruit']).cumcount().eq(0)
df['Quantity'] = df['Quantity'].where(m.values, 0)
print (df)
Fruit Quantity
01/02/10 Apple 4
01/02/10 Apple 0
01/02/10 Pear 7
01/02/10 Grape 8
01/02/10 Grape 0
02/02/10 Apple 2
02/02/10 Fruit 6
02/02/10 Pear 8
02/02/10 Pear 0
<强>计时强>:
np.random.seed(1235)
N = 10000
L = ['Apple','Pear','Grape','Fruit']
idx = np.repeat(pd.date_range('2017-010-01', periods=N/20).strftime('%d/%m/%y'), 20)
df = (pd.DataFrame({'Fruit': np.random.choice(L, N),
'Quantity':np.random.randint(100, size=N), 'idx':idx})
.sort_values(['Fruit','idx'])
.set_index('idx')
.rename_axis(None))
#print (df)
def jez1(df):
m = df.rename_axis('Date').groupby(['Date', 'Fruit']).cumcount().eq(0)
df['Quantity'] = df['Quantity'].where(m, 0)
return df
def jez2(df):
m = df.reset_index().groupby(['index', 'Fruit']).cumcount().eq(0)
df['Quantity'] = df['Quantity'].where(m.values, 0)
return df
def rnso(df):
df['date_fruit'] = df.index+df.Fruit # new column with date and fruit merged
dflist = pd.unique(df.date_fruit) # find its unique values
dfv = df.values # get rows as list of lists
for i in dflist: # for each unique date-fruit combination
done = False
for c in range(len(dfv)):
if dfv[c][2] == i: # check each row
if done:
dfv[c][1] = 0 # if not first, make quantity as 0
else:
done = True
# create new dataframe with new data:
newdf = pd.DataFrame(data=dfv, columns=df.columns, index=df.index)
return newdf.iloc[:,:2]
print (jez1(df))
print (jez2(df))
print (rnso(df))
In [189]: %timeit (rnso(df))
1 loop, best of 3: 6.27 s per loop
In [190]: %timeit (jez1(df))
100 loops, best of 3: 7.56 ms per loop
In [191]: %timeit (jez2(df))
100 loops, best of 3: 8.77 ms per loop
通过另一个答案编辑:
您需要通过列Fruit
和index
重复调用问题。
所以有两种可能的解决方案:
reset_index
从索引创建列并调用DataFrame.duplicated
,最后通过values
将输出转换为numpy数组set_index
之后将Fruit
列添加到index
并致电Index.duplicated
#solution1
mask = df.reset_index().duplicated(['index','Fruit']).values
#solution2
#mask = df.set_index('Fruit', append=True).index.duplicated()
df.loc[mask, 'Quantity'] = 0
<强> Timings1 强>
def jez1(df):
m = df.rename_axis('Date').groupby(['Date', 'Fruit']).cumcount().eq(0)
df['Quantity'] = df['Quantity'].where(m, 0)
return df
def jez3(df):
mask = df.reset_index().duplicated(['index','Fruit']).values
df.loc[mask, 'Quantity'] = 0
return df
def jez4(df):
mask = df.set_index('Fruit', append=True).index.duplicated()
df.loc[mask, 'Quantity'] = 0
return df
print (jez1(df))
print (jez3(df))
print (jez4(df))
In [268]: %timeit jez1(df)
100 loops, best of 3: 6.37 ms per loop
In [269]: %timeit jez3(df)
100 loops, best of 3: 3.82 ms per loop
In [270]: %timeit jez4(df)
100 loops, best of 3: 4.21 ms per loop
答案 1 :(得分:0)
可以在索引中合并Fruit和date,并使用for
循环将剩余数量值转换为0:
df['date_fruit'] = df.index+df.Fruit # new column with date and fruit merged
dflist = pd.unique(df.date_fruit) # find its unique values
dfv = df.values # get rows as list of lists
for i in dflist: # for each unique date-fruit combination
done = False
for c in range(len(dfv)):
if dfv[c][2] == i: # check each row
if done:
dfv[c][1] = 0 # if not first, make quantity as 0
else:
done = True
# create new dataframe with new data:
newdf = pd.DataFrame(data=dfv, columns=df.columns, index=df.index)
newdf = newdf.iloc[:,:2] # remove merged date-fruit column
print(newdf)
输出:
Fruit Quantity
01/02/10 Apple 4
01/02/10 Apple 0
01/02/10 Pear 7
01/02/10 Grape 8
01/02/10 Grape 0
02/02/10 Apple 2
02/02/10 Fruit 6
02/02/10 Pear 8
02/02/10 Pear 0
答案 2 :(得分:0)
我阅读了OP的问题,但我认为不需要使用if(count($error) > 0)
{
foreach($error as $key => $value)
{
echo "ERROR: $value<br />\n";
}
}
:
pd.groupby()
以获取每天重复水果的本地化由于我们只使用pd.Series.duplicated()
,因此速度要快得多。
duplicated