我有一个数据框,其中一列只包含字符串元组,索引是一个月中的日期和星期几。
所以:
all_app_id day_of_week dayofmonth
0 (g8m4lecv, uyhsx6lo, kz8udlea, uyhsx6lo, uyhsx... Fri 29
1 (awkcmpns, yg02r5dz, yg02r5dz, yg02r5dz, t6rf8... Sat 30
2 (kz8udlea, 6f91a1e2, mv4uo5xy, mv4uo5xy, lfqlz... Sun 31
3 (ze4rr0vi, t6rf8pr, ze4rr0vi, 3c83655d34c365c9... Mon 1
4 (dflewd0u, twidi4l3, twidi4l3, ecur4180, e0ff8... Tues 2
5 (ohafu6x5, kz8udlea, ohafu6x5, jfigr5ko, ohafu... Weds 3
以及唯一的app_id列表:
array(['g8m4lecv', 'uyhsx6lo', 'kz8udlea', ..., 'nrow38d5', 'dekfx963',
'ybkne8eo'], dtype=object)
我想要计算的是,对于唯一app_id列表中的每个app_id
,它们显示的次数均按day_of_week
和dayofmonth
分组,因此每个app_id
我可以计算最常见的day_of_week
和dayofmonth
,总体来说,我还可以通过对结果数据框执行操作来计算最常见的日期和月份。
示例输出(此后应该很容易)是:
星期几
Mon Tues Weds Thurs Fri
g8m4lecv 34 53 84 97 20
uyhsx6lo 3423 5443 235 33 54
kz8udlea 121 1212 39 93 756
...
日期
1 2 3 4 5 ...
g8m4lecv 3 5 8 7 20
uyhsx6lo 12 12 23 233 54
kz8udlea 31 212 39 93 2
...
我一直在尝试使用类似答案中的groupby
和for app_id in unique_app_ids
以及str.contains(app_id).sum()
,但我似乎无法理解如何构建此循环以及如何组织循环。我也在调查list(zip())
以组合元组,但又不确定如何在没有分组的情况下按星期结合。
答案 0 :(得分:1)
df = pd.DataFrame(
{'all_app_id': ['g8m4lecv, uyhsx6lo, kz8udlea, uyhsx6lo, uyhsx',
'awkcmpns, yg02r5dz, yg02r5dz, yg02r5dz, t6rf8',
'kz8udlea, 6f91a1e2, mv4uo5xy, mv4uo5xy, lfqlz',
'ze4rr0vi, t6rf8pr, ze4rr0vi, 3c83655d34c365c9',
'dflewd0u, twidi4l3, twidi4l3, ecur4180, e0ff8',
'ohafu6x5, kz8udlea, ohafu6x5, jfigr5ko, ohafu'],
'day_of_week': ['Fri', 'Sat', 'Sun', 'Mon', 'Tues', 'Weds'],
'dayofmonth': [29, 30, 31, 1, 2, 3]})
>>> (pd.melt(
pd.concat([df.all_app_id.str.split(',', expand=True),
df[['day_of_week', 'dayofmonth']]],
axis=1),
id_vars=['day_of_week', 'dayofmonth'],
value_name='app_id')
.drop('variable', axis=1)
.pivot_table(index='app_id', columns='day_of_week', aggfunc='count')
.fillna(0)
)
dayofmonth
day_of_week Fri Mon Sat Sun Tues Weds
app_id
3c83655d34c365c9 0 1 0 0 0 0
6f91a1e2 0 0 0 1 0 0
e0ff8 0 0 0 0 1 0
ecur4180 0 0 0 0 1 0
jfigr5ko 0 0 0 0 0 1
kz8udlea 1 0 0 0 0 1
lfqlz 0 0 0 1 0 0
mv4uo5xy 0 0 0 2 0 0
ohafu 0 0 0 0 0 1
ohafu6x5 0 0 0 0 0 1
t6rf8 0 0 1 0 0 0
t6rf8pr 0 1 0 0 0 0
twidi4l3 0 0 0 0 2 0
uyhsx 1 0 0 0 0 0
uyhsx6lo 2 0 0 0 0 0
yg02r5dz 0 0 3 0 0 0
ze4rr0vi 0 1 0 0 0 0
awkcmpns 0 0 1 0 0 0
dflewd0u 0 0 0 0 1 0
g8m4lecv 1 0 0 0 0 0
kz8udlea 0 0 0 1 0 0
ohafu6x5 0 0 0 0 0 1
ze4rr0vi 0 1 0 0 0 0
首先,您需要将all_app_id
列拆分为不同的列:
>>> df.all_app_id.str.split(',', expand=True)
0 1 2 3 4
0 g8m4lecv uyhsx6lo kz8udlea uyhsx6lo uyhsx
1 awkcmpns yg02r5dz yg02r5dz yg02r5dz t6rf8
2 kz8udlea 6f91a1e2 mv4uo5xy mv4uo5xy lfqlz
3 ze4rr0vi t6rf8pr ze4rr0vi 3c83655d34c365c9 None
4 dflewd0u twidi4l3 twidi4l3 ecur4180 e0ff8
5 ohafu6x5 kz8udlea ohafu6x5 jfigr5ko ohafu
然后再粘贴dayofmonth和day_of_week列:
df2 = pd.concat([df.all_app_id.str.split(',', expand=True),
df[['day_of_week', 'dayofmonth']]], axis=1)
>>> df2
0 1 2 3 4 day_of_week dayofmonth
0 g8m4lecv uyhsx6lo kz8udlea uyhsx6lo uyhsx Fri 29
1 awkcmpns yg02r5dz yg02r5dz yg02r5dz t6rf8 Sat 30
2 kz8udlea 6f91a1e2 mv4uo5xy mv4uo5xy lfqlz Sun 31
3 ze4rr0vi t6rf8pr ze4rr0vi 3c83655d34c365c9 None Mon 1
4 dflewd0u twidi4l3 twidi4l3 ecur4180 e0ff8 Tues 2
5 ohafu6x5 kz8udlea ohafu6x5 jfigr5ko ohafu Weds 3
然后你融化了结果:
>>> df3 = pd.melt(df2, id_vars=['day_of_week', 'dayofmonth'], value_name='app_id')
>>> df3
day_of_week dayofmonth variable app_id
0 Fri 29 0 g8m4lecv
1 Sat 30 0 awkcmpns
2 Sun 31 0 kz8udlea
...
28 Tues 2 4 e0ff8
29 Weds 3 4 ohafu
删除variable
列:
df3 = df3.drop('variable', axis=1)
然后转动结果:
df4 = df3.pivot_table(index='app_id', columns='day_of_week', aggfunc='count')
>>> df4
dayofmonth
day_of_week Fri Mon Sat Sun Tues Weds
app_id
3c83655d34c365c9 NaN 1 NaN NaN NaN NaN
6f91a1e2 NaN NaN NaN 1 NaN NaN
e0ff8 NaN NaN NaN NaN 1 NaN
ecur4180 NaN NaN NaN NaN 1 NaN
...
然后用零填充NaN
值。
df4 = df4.fillna(0)
此时,您还可以从顶部列中删除dayofmonth
:
df4.columns = df4.columns.droplevel(0)
瞧!
>>> df4.head()
day_of_week Fri Mon Sat Sun Tues Weds
app_id
3c83655d34c365c9 0 1 0 0 0 0
6f91a1e2 0 0 0 1 0 0
e0ff8 0 0 0 0 1 0
ecur4180 0 0 0 0 1 0
jfigr5ko 0 0 0 0 0 1