根据问题标题。熊猫可以依靠RollingGroupby对象上的字符串型列吗?
这是我的数据框:
# Let's say my objective is to count the number of unique cars
# over the last 1 day grouped by park
park | date | to_count
------------------------------
A | 2019-01-01 | Honda
A | 2019-01-03 | Lexus
A | 2019-01-05 | BMW
A | 2019-01-05 | Lexus
B | 2019-01-01 | BMW
B | 2019-01-08 | Lexus
B | 2019-01-08 | Lexus
B | 2019-01-10 | Ford
这就是我想要的:
park | date | unique_count
----------------------------------
A | 2019-01-01 | 1
A | 2019-01-03 | 1
A | 2019-01-05 | 2
B | 2019-01-01 | 1
B | 2019-01-08 | 1
B | 2019-01-10 | 1
# Bit of explanation:
# There are 2 type of cars coming to park A over last 1 day on 5th Jan so distinct count is 2.
# There are 2 cars of 1 type (Lexus) coming to park B over last 1 day on 8th Jan so distinct count is 1.
这是我尝试过的:
import pandas as pd
import numpy as np
# initiate dataframe
df = pd.DataFrame({
'park': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'date': ['2019-01-01', '2019-01-03', '2019-01-05', '2019-01-05',
'2019-01-01', '2019-01-08', '2019-01-08', '2019-01-10'],
'to_count': ['Honda', 'Lexus', 'BMW', 'Lexus', 'BMW', 'Lexus', 'Lexus', 'Ford']
})
# string to date
df['date'] = pd.to_datetime(df['date'])
# group. This is more intuitive to me but sadly this does not work.
unique_count = df.groupby('park').rolling('1d', on='date').to_count.nunique()
# factorize then group. This works (but why???)
df['factorized'] = pd.factorize(df.to_count)[0]
unique_count = df.groupby('park').rolling('1d', on='date').factorized.apply(lambda x: len(np.unique(x)) )
result = unique_count.reset_index().drop_duplicates(subset=['park', 'date'], keep='last')
这是我的环境:
为了强调,我需要滚动窗口功能才能正常工作。在此示例中,窗口恰好是1天,但我可能希望它工作3天,7天,2小时,5秒。
答案 0 :(得分:1)
尝试一下:
-首先,按park
和date
对数据帧进行分组
-通过其唯一值数量汇总to_count
df = pd.DataFrame({
'park': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'date': ['2019-01-01', '2019-01-03', '2019-01-05', '2019-01-05',
'2019-01-01', '2019-01-08', '2019-01-08', '2019-01-10'],
'to_count': ['Honda', 'Lexus', 'BMW', 'Lexus', 'BMW', 'Lexus', 'Lexus', 'Ford']
})
agg_df = df.groupby(by=['park', 'date']).agg({'to_count': pd.Series.nunique}).reset_index()
答案 1 :(得分:0)
我的解决方案不是非常pythonic,但是我认为可以完成工作。
我一次停放一个公园,我将数据帧切成天数偏移(调整天数以获取滚动量),然后将汽车值检索到一个列表中。
使用每天的列表中的汽车,我们可以计算每天的独特汽车总数。
结果是一个列表,您可以根据需要将其转换为数据框。
import pandas as pd
import datetime
# initiate dataframe
df = pd.DataFrame({
'park': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'date': ['2019-01-01', '2019-01-03', '2019-01-05', '2019-01-05',
'2019-01-01', '2019-01-08', '2019-01-08', '2019-01-10'],
'to_count': ['Honda', 'Lexus', 'BMW', 'Lexus', 'BMW', 'Lexus', 'Lexus', 'Ford']
})
# string to date
df['date'] = pd.to_datetime(df['date'])
result = []
for park in ['A', 'B']:
# Do one park at a time
df_park = df[df['park'] == park][['date','to_count']]
df_park.set_index('date',inplace=True)
# interate through the dataframe and put results to list.
for i, v in df_park.iterrows():
# THIS IS YOUR ROLLING VALUE IN DAYS
days = 1
# create the starting date
b = i - datetime.timedelta(days=days)
# create a list of cars during the period
li = df_park.loc[b:i].values
# reduce the list to unique cars
unique_cars = len(np.unique(li))
# append the results to the result list
result.append((park, i.strftime('%B %d, %Y'), unique_cars))
# the final list has duplicates, so use set to drop the dups and re-sort for the result.
sorted(list(set(result)))
结果如下:
[('A', 'January 01, 2019', 1),
('A', 'January 03, 2019', 1),
('A', 'January 05, 2019', 2),
('B', 'January 01, 2019', 1),
('B', 'January 08, 2019', 1),
('B', 'January 10, 2019', 1)]