制作DataFrame:
people = ['shayna','shayna','shayna','shayna','john']
dates = ['01-01-18','01-01-18','01-01-18','01-02-18','01-02-18']
places = ['hospital', 'hospital', 'inpatient', 'hospital', 'hospital']
d = {'Person':people,'Service_Date':dates, 'Site_Where_Served':places}
df = pd.DataFrame(d)
df
Person Service_Date Site_Where_Served
shayna 01-01-18 hospital
shayna 01-01-18 hospital
shayna 01-01-18 inpatient
shayna 01-02-18 hospital
john 01-02-18 hospital
我想要做的是计算由Site_Where_Served分组的Person及其Service_Date的唯一对。
预期产出:
Site_Where_Served Site_Visit_Count
hospital 3
inpatient 1
我的尝试:
df[['Person', 'Service_Date']].groupby(df['Site_Where_Served']).nunique().reset_index(name='Site_Visit_Count')
但是它并不知道如何重置索引。所以,我试着把它排除在外,我意识到它并不是在计算一对独特的“人物”。和' Service_Date',因为输出如下所示:
Person Service_Date
Site_Where_Served
hospital 2 2
inpatient 1 1
答案 0 :(得分:4)
drop_duplicates
groupby
+ count
(df.drop_duplicates()
.groupby('Site_Where_Served')
.Site_Where_Served.count()
.reset_index(name='Site_Visit_Count')
)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1
请注意,count
/ size
之间的一个微小区别是前者不计算NaN条目。
groupby
和nunique
这实际上只是修复了您当前的解决方案,但我不建议这样做,因为它需要更长时间的步骤。首先,对列进行整理,按Site_Where_Served
分组,然后计算:
(df[['Person', 'Service_Date']]
.apply(tuple, 1)
.groupby(df.Site_Where_Served)
.nunique()
.reset_index(name='Site_Visit_Count')
)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1
答案 1 :(得分:3)
在我看来,更好的方法是在使用groupby.size
之前删除重复项:
res = df.drop_duplicates()\
.groupby('Site_Where_Served').size()\
.reset_index(name='Site_Visit_Count')
print(res)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1
答案 2 :(得分:2)
也许> InputDT[CountryFreq,
+ .SD[sample(.N, min(.N, Freq*Sample_Size))],
+ by=.EACHI,
+ on=.(Country)]
Country ID
1: A 19
2: A 7
3: A 5
4: A 3
5: B 109
6: B 110
7: C 203
8: C 205
9: D 302
10: D 301
> InputDT[CountryFreq,
+ .SD[sample(.N, min(.N, Freq*Sample_Size))],
+ by=.EACHI,
+ on=.(Country)]
Country ID
1: A 12
2: A 19
3: A 17
4: A 10
5: B 110
6: B 105
7: C 202
8: C 203
9: D 302
10: D 301
> InputDT[CountryFreq,
+ .SD[sample(.N, min(.N, Freq*Sample_Size))],
+ by=.EACHI,
+ on=.(Country)]
Country ID
1: A 9
2: A 7
3: A 19
4: A 6
5: B 106
6: B 108
7: C 205
8: C 201
9: D 302
10: D 301
value_counts
答案 3 :(得分:1)
Counter
1 pd.Series(Counter(df.drop_duplicates().Site_Where_Served)) \
.rename_axis('Site_Where_Served').reset_index(name='Site_Visit_Count')
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1
Counter
2 pd.DataFrame(
list(Counter(t[2] for t in set(map(tuple, df.values))).items()),
columns=['Site_Where_Served', 'Site_Visit_Count']
)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1