Question

请注意以下问题的建议重复项：分类排序对此不起作用，因为它仅使用列中字符串的子集进行排序。如果将其设置为分类索引，它将使所有未列出的“类别” / strings为空。

原始问题：我有一个可行的示例，但我觉得必须有一种更好/更有效的方法来计算这些结果。

我有一个很大的机器数据数据框，其中在每个时间戳记中事件顺序没有得到正确维护。看起来像下面的输入事件列。您可以看到所选事件已根据每个时间戳记中的event_order列表进行了重新排序。

输入是事件。期望的输出是最后一列中的sorted_output事件。添加的用于显示排序的水平线仅在每个时间戳块内。

时间戳已简化为整数。事件名称也已简化。这些不是字母，而是非示例数据中的完整字符串名称。

有没有更有效的方法？

                  input      sorted_output
    timestamp     event      event
0           0      wer       wer   
_________________________________
1           1       up       dog
2           1      def       def
3           1      abc       abc
4           1      dog      fast
5           1      prq       prq
6           1      cde       cde
7           1     fast        up
8           1      bnm       bnm
_________________________________
9           2      ert       ert
10          2      and       and
11          2      ert       ert
12          2      ghj       ghj
13          2  streets      down
14          2     down   streets
_________________________________
15          3     runs       dog
16          3      dog      runs
17          3      ert       ert
18          3       up        up
19          3      dfg       dfg
20          3      prq       prq

工作代码

import pandas as pd

df = pd.DataFrame(
    [
        {'timestamp': 0, 'event': 'wer'},
        {'timestamp': 1, 'event': 'up'},
        {'timestamp': 1, 'event': 'def'},
        {'timestamp': 1, 'event': 'abc'},
        {'timestamp': 1, 'event': 'dog'},
        {'timestamp': 1, 'event': 'prq'},
        {'timestamp': 1, 'event': 'cde'},
        {'timestamp': 1, 'event': 'fast'},
        {'timestamp': 1, 'event': 'bnm'},
        {'timestamp': 2, 'event': 'ert'},
        {'timestamp': 2, 'event': 'and'},
        {'timestamp': 2, 'event': 'ert'},
        {'timestamp': 2, 'event': 'ghj'},
        {'timestamp': 2, 'event': 'streets'},
        {'timestamp': 2, 'event': 'down'},
        {'timestamp': 3, 'event': 'runs'},
        {'timestamp': 3, 'event': 'dog'},
        {'timestamp': 3, 'event': 'ert'},
        {'timestamp': 3, 'event': 'up'},
        {'timestamp': 3, 'event': 'dfg'},
        {'timestamp': 3, 'event': 'prq'},
    ]
)
df = df[['timestamp', 'event']]

# events to sort in order (they aren't actually alphabetical this is mock data)
events_to_sort = ['dog', 'runs', 'fast', 'up', 'and', 'down', 'streets']

# this method gleaned from here https://stackoverflow.com/questions/23482668/sorting-by-a-custom-list-in-pandas
sorter_index = dict(zip(events_to_sort, range(len(events_to_sort))))

# create a temporary rank column for sorting
df['sort_col'] = df['event'].map(sorter_index)

ev_ind = df.event.isin(events_to_sort)

# loop through each timestamp block
for time in df.timestamp.unique():
    # limit to only sortable events within the timestamp
    section_index = df.timestamp.eq(time) & ev_ind
    df_temp = df.loc[section_index]

    if len(df_temp) > 1:
        # if there is more than 1 sortable event tag sort and set the values back to the original df
        df.loc[section_index, 'event'] = df_temp.sort_values(by='sort_col')['event'].values

# drop temp sorting col
df = df.drop('sort_col', axis=1)

Answer 1

以您的情况

s=df.loc[df.event.isin(events_to_sort)].copy()
s.event=pd.Categorical(s.event,categories=events_to_sort,ordered=True)
s=s.sort_values(['timestamp','event'])
s.index=sorted(s.index)
df=s.combine_first(df)

Answer 2

WenyoBen的答案让我思考，并为我填补了缺失的难题。这是两个可行的解决方案。一种使用分类排序，另一种使用映射的排序。

解决方案1地图排序（使用其他排序列）

sorter_index = dict(zip(events_to_sort, range(len(events_to_sort))))

# get subset to sort
s = df.loc[df.event.isin(events_to_sort)].copy()

# make sort column
s['sort_col'] = s['event'].map(sorter_index)

# do sorting by sort columns first then timestamp
s = s.sort_values(['timestamp', 'sort_col'])

# reorder the index such that they will insert back into original df properly
s.index = sorted(s.index)

# remove the temporary sort_col
s.drop('sort_col', axis=1, inplace=True)

# place sorted events back into original df in the correct location
df = s.combine_first(df)

解决方案2的分类排序

# get subset to sort
s = df.loc[df.event.isin(events_to_sort)].copy()

# convert event column to categorical type
s.event = s.event.astype('category')

# set category sort order
s['event'] = s['event'].cat.set_categories(events_to_sort)


# sort by event then timestamp
s = s.sort_values(['timestamp', 'event'])

# reorder the index such that they will insert back into original df properly
s.index = sorted(s.index)

# place sorted events back into original df in the correct location
df = s.combine_first(df)

两个输出：

    timestamp    event
0         0.0      wer
1         1.0      dog
2         1.0      def
3         1.0      abc
4         1.0     fast
5         1.0      prq
6         1.0      cde
7         1.0       up
8         1.0      bnm
9         2.0      ert
10        2.0      and
11        2.0      ert
12        2.0      ghj
13        2.0     down
14        2.0  streets
15        3.0      dog
16        3.0     runs
17        3.0      ert
18        3.0       up
19        3.0      dfg
20        3.0      prq

根据自定义顺序对相同时间戳中的选定行进行有效排序

2 个答案: