如何过滤熊猫数据框,然后进行分组和汇总值列表?

时间:2020-09-18 17:01:24

标签: python pandas group-by apply

我正在尝试使用groupby并将值作为列表获取。

结尾df应该以“出价”作为索引,得分作为第二列的列表(例如,如果它们都具有相同的“出价”,则为[85,58]]

这是我的df:

当我使用merged.groupby("bid")['score_y'].apply(list)

我收到TypeError:“系列”对象是可变的,因此无法进行散列。

有人知道我为什么会收到此错误吗?

编辑1:

这是数据源:https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores-LIVES-Standard/pyih-qa8i

enter image description here

df“ ins”产生以下内容,其中“ bid”是“ iid”中“ _”之前的数字。 enter image description here

到目前为止,我的代码:

ins2018 = ins[ins['year'] == 2018] #.drop(["iid", 'date', 'type', 'timestamp', 'year', 'Missing Score'], axis = 1)
    
# new = ins2018.loc[ins2018["score"] > 0].sort_values("date").groupby("bid").count()
# new = new.loc[new["iid"] == 2]
# merge = pd.merge(new, ins2018, how = "left", on = "bid").sort_values('date_y')
# merged = merge.loc[merge['score_y'] > 0].drop(['iid_x', 'date_x', 'score_x', 'type_x', 'timestamp_x', 'year_x', 'Missing Score_x', 'iid_y', 'type_y', 'timestamp_y', 'year_y', 'Missing Score_y', "date_y"], axis = 1)

1 个答案:

答案 0 :(得分:1)

  • 使用pandas.DataFrame.aggregat[DllImport ("gsdll32.dll", EntryPoint = "gsapi_new_instance")] private static extern int CreateAPIInstance (out IntPtr instance, IntPtr caller); 聚合到list
  • 取决于score_y,可能需要重设索引。
merged

示例

# reset the index of of merged
merged = merged.reset_index(drop=True)

# groupby bid and aggregate a list onto score_y
merged.groupby('bid').agg({'score_y': list})
import pandas as pd
import numpy as np
import random

np.random.seed(365)
random.seed(365)
rows = 100
data = {'a': np.random.randint(10, size=(rows)),
        'groups': [random.choice(['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']) for _ in range(rows)]}
df = pd.DataFrame(data)

# groupby and aggregate a list
dfg = df.groupby('groups').agg({'a': list})

dfg
[out]: 
                                                                                 a
groups                                                                            
1-5                                     [7, 8, 4, 3, 1, 7, 9, 3, 2, 7, 6, 4, 4, 6]
100-500                        [4, 3, 2, 8, 6, 3, 1, 5, 7, 7, 3, 5, 4, 7, 2, 2, 4]
26-100                   [4, 2, 2, 9, 5, 3, 1, 0, 7, 9, 7, 7, 9, 9, 9, 7, 0, 0, 4]
500-1000                                      [2, 8, 0, 7, 6, 6, 8, 4, 6, 2, 2, 5]
6-25      [5, 9, 7, 0, 6, 5, 7, 9, 9, 9, 6, 5, 6, 0, 2, 7, 4, 0, 3, 9, 0, 5, 0, 3]
>1000                                   [2, 1, 3, 6, 7, 6, 0, 5, 9, 9, 3, 2, 6, 0]
import pandas as pd # load data ins = pd.read_csv('data/Restaurant_Scores_-_LIVES_Standard.csv') # convert inspection_date to a datetime format ins.inspection_date = pd.to_datetime(ins.inspection_date) # add a year column ins['year'] = ins.inspection_date.dt.year # select data for 2018 ins2018 = ins[ins['year'] == 2018] ################################################################ # this is where you run into issues # new is the counts for every column # this is what you could have done to get the number of inspection counts # just count the occurrences of business_id counts = ins2018.groupby('business_id').agg({'business_id': 'count'}).rename(columns={'business_id': 'inspection_counts'}).reset_index() # don't do this: get dataframe of counts # new = ins2018.loc[ins2018["inspection_score"] > 0].sort_values("inspection_date").groupby("business_id").count() # don't do this: select data # new = new.loc[new["inspection_id"] == 2].reset_index() # merge updated merge = pd.merge(counts, ins2018, how = "left", on = "business_id") ################################################################ # select data again merged = merge.loc[(merge['inspection_score_y'] > 0) & (merge.inspection_counts >= 2)] # groupby and aggregate list mg = merged.groupby('business_id').agg({'inspection_score_y': list}) # display(mg) inspection_score_y business_id 31 [96.0, 96.0] 54 [94.0, 94.0] 61 [94.0, 94.0] 66 [98.0, 98.0] 101 [92.0, 92.0] 上的

groupby已更新

ins
相关问题