请注意以下问题的建议重复项:分类排序对此不起作用,因为它仅使用列中字符串的子集进行排序。如果将其设置为分类索引,它将使所有未列出的“类别” / strings为空。
原始问题: 我有一个可行的示例,但我觉得必须有一种更好/更有效的方法来计算这些结果。
我有一个很大的机器数据数据框,其中在每个时间戳记中事件顺序没有得到正确维护。看起来像下面的输入事件列。您可以看到所选事件已根据每个时间戳记中的event_order列表进行了重新排序。
输入是事件。 期望的输出是最后一列中的sorted_output事件。添加的用于显示排序的水平线仅在每个时间戳块内。
时间戳已简化为整数。 事件名称也已简化。这些不是字母,而是非示例数据中的完整字符串名称。
有没有更有效的方法?
input sorted_output
timestamp event event
0 0 wer wer
_________________________________
1 1 up dog
2 1 def def
3 1 abc abc
4 1 dog fast
5 1 prq prq
6 1 cde cde
7 1 fast up
8 1 bnm bnm
_________________________________
9 2 ert ert
10 2 and and
11 2 ert ert
12 2 ghj ghj
13 2 streets down
14 2 down streets
_________________________________
15 3 runs dog
16 3 dog runs
17 3 ert ert
18 3 up up
19 3 dfg dfg
20 3 prq prq
工作代码
import pandas as pd
df = pd.DataFrame(
[
{'timestamp': 0, 'event': 'wer'},
{'timestamp': 1, 'event': 'up'},
{'timestamp': 1, 'event': 'def'},
{'timestamp': 1, 'event': 'abc'},
{'timestamp': 1, 'event': 'dog'},
{'timestamp': 1, 'event': 'prq'},
{'timestamp': 1, 'event': 'cde'},
{'timestamp': 1, 'event': 'fast'},
{'timestamp': 1, 'event': 'bnm'},
{'timestamp': 2, 'event': 'ert'},
{'timestamp': 2, 'event': 'and'},
{'timestamp': 2, 'event': 'ert'},
{'timestamp': 2, 'event': 'ghj'},
{'timestamp': 2, 'event': 'streets'},
{'timestamp': 2, 'event': 'down'},
{'timestamp': 3, 'event': 'runs'},
{'timestamp': 3, 'event': 'dog'},
{'timestamp': 3, 'event': 'ert'},
{'timestamp': 3, 'event': 'up'},
{'timestamp': 3, 'event': 'dfg'},
{'timestamp': 3, 'event': 'prq'},
]
)
df = df[['timestamp', 'event']]
# events to sort in order (they aren't actually alphabetical this is mock data)
events_to_sort = ['dog', 'runs', 'fast', 'up', 'and', 'down', 'streets']
# this method gleaned from here https://stackoverflow.com/questions/23482668/sorting-by-a-custom-list-in-pandas
sorter_index = dict(zip(events_to_sort, range(len(events_to_sort))))
# create a temporary rank column for sorting
df['sort_col'] = df['event'].map(sorter_index)
ev_ind = df.event.isin(events_to_sort)
# loop through each timestamp block
for time in df.timestamp.unique():
# limit to only sortable events within the timestamp
section_index = df.timestamp.eq(time) & ev_ind
df_temp = df.loc[section_index]
if len(df_temp) > 1:
# if there is more than 1 sortable event tag sort and set the values back to the original df
df.loc[section_index, 'event'] = df_temp.sort_values(by='sort_col')['event'].values
# drop temp sorting col
df = df.drop('sort_col', axis=1)
答案 0 :(得分:2)
以您的情况
s=df.loc[df.event.isin(events_to_sort)].copy()
s.event=pd.Categorical(s.event,categories=events_to_sort,ordered=True)
s=s.sort_values(['timestamp','event'])
s.index=sorted(s.index)
df=s.combine_first(df)
答案 1 :(得分:0)
WenyoBen的答案让我思考,并为我填补了缺失的难题。这是两个可行的解决方案。一种使用分类排序,另一种使用映射的排序。
解决方案1地图排序(使用其他排序列)
sorter_index = dict(zip(events_to_sort, range(len(events_to_sort))))
# get subset to sort
s = df.loc[df.event.isin(events_to_sort)].copy()
# make sort column
s['sort_col'] = s['event'].map(sorter_index)
# do sorting by sort columns first then timestamp
s = s.sort_values(['timestamp', 'sort_col'])
# reorder the index such that they will insert back into original df properly
s.index = sorted(s.index)
# remove the temporary sort_col
s.drop('sort_col', axis=1, inplace=True)
# place sorted events back into original df in the correct location
df = s.combine_first(df)
解决方案2的分类排序
# get subset to sort
s = df.loc[df.event.isin(events_to_sort)].copy()
# convert event column to categorical type
s.event = s.event.astype('category')
# set category sort order
s['event'] = s['event'].cat.set_categories(events_to_sort)
# sort by event then timestamp
s = s.sort_values(['timestamp', 'event'])
# reorder the index such that they will insert back into original df properly
s.index = sorted(s.index)
# place sorted events back into original df in the correct location
df = s.combine_first(df)
两个输出:
timestamp event
0 0.0 wer
1 1.0 dog
2 1.0 def
3 1.0 abc
4 1.0 fast
5 1.0 prq
6 1.0 cde
7 1.0 up
8 1.0 bnm
9 2.0 ert
10 2.0 and
11 2.0 ert
12 2.0 ghj
13 2.0 down
14 2.0 streets
15 3.0 dog
16 3.0 runs
17 3.0 ert
18 3.0 up
19 3.0 dfg
20 3.0 prq