按字符串中的第一个字母对pandas数据框的文本字符串列进行排序

时间:2018-03-26 00:08:38

标签: pandas sorting dataframe text

我有一个pandas df,我需要按照文本字符串的列变量进行排序。我试过三种方法。前两个是相似的。最后一种方法排序,但它也产生了一个神秘的专栏。

这是小测试数据集:

raw_corpus  #test data


unique_ID   count   trigger_channel_cat
0   11530   1   Photo and Video
1   17176   1   Environment Control and Monitoring
2   6984    1   Security and Monitoring Systems
3   15696   1   Photo and Video
4   16103   3   Finance and Payments
5   18534   5   News and Information
6   11677   331 Social Networks
7   702 1   Contacts
8   7251    1   Business Tools
9   10609   1   Photo and Video
10  1703    2   Blogging
11  20567   1   Social Networks
12  8357    1   Social Networks
13  4313    1   Fitness and Wearables
14  8552    1   Contacts
15  7634    1   News and Information
16  13698   1   Social Networks
17  13940   4   Business Tools
18  19784   3   Location
19  3561    1   Task Management and To-Dos

使用value_counts不起作用:

raw_corpus_sorted=raw_corpus['trigger_channel_cat'].value_counts().index.tolist()
raw_corpus_sorted

['Social Networks',
 'Photo and Video',
 'Business Tools',
 'Contacts',
 'News and Information',
 'Fitness and Wearables',
 'Location',
 'Security and Monitoring Systems',
 'Task Management and To-Dos',
 'Environment Control and Monitoring',
 'Blogging',
 'Finance and Payments']

再次尝试使用对value_counts的不同调用,为每个类别提供正确的实例数,但不对类别进行排序:

raw_corpus_sorted=raw_corpus['trigger_channel_cat'].value_counts(sort=True) 
raw_corpus_sorted

Social Networks                       4
Photo and Video                       3
Business Tools                        2
Contacts                              2
News and Information                  2
Fitness and Wearables                 1
Location                              1
Security and Monitoring Systems       1
Task Management and To-Dos            1
Environment Control and Monitoring    1
Blogging                              1
Finance and Payments                  1
Name: trigger_channel_cat, dtype: int64

使用sort_values()排序!但是第一列是什么?

#this one works - but what is that first column?
raw_corpus_sorted=raw_corpus['trigger_channel_cat'].sort_values()
raw_corpus_sorted

10                              Blogging
17                        Business Tools
8                         Business Tools
14                              Contacts
7                               Contacts
1     Environment Control and Monitoring
4                   Finance and Payments
13                 Fitness and Wearables
18                              Location
15                  News and Information
5                   News and Information
0                        Photo and Video
9                        Photo and Video
3                        Photo and Video
2        Security and Monitoring Systems
11                       Social Networks
6                        Social Networks
16                       Social Networks
12                       Social Networks
19            Task Management and To-Dos
Name: trigger_channel_cat, dtype: object

1 个答案:

答案 0 :(得分:1)

当你致电sort_values

时,你需要添加()并传递目标列以在最后排序
raw_corpus_sorted=raw_corpus.sort_values('trigger_channel_clean')

自添加数据

df.sort_values(' trigger_channel_cat')
Out[1086]: 
    unique_ID  count      trigger_channel_cat
10       1703      2                 Blogging
17      13940      4           Business Tools
8        7251      1           Business Tools
14       8552      1                 Contacts
1       17176      1  Environment Control and
4       16103      3     Finance and Payments
13       4313      1    Fitness and Wearables
18      19784      3                 Location
15       7634      1     News and Information
5       18534      5     News and Information
0       11530      1          Photo and Video
9       10609      1          Photo and Video
3       15696      1          Photo and Video
2        6984      1  Security and Monitoring
12       8357      1          Social Networks
6       11677    331          Social Networks
16      13698      1          Social Networks
11      20567      1          Social Networks
19       3561      1  Task Management and To-
7         702      1                     acts

对于value_counts,您可以sort_index

df['trigger_channel_cat'].value_counts(sort=True).sort_index()
Out[1088]: 
Blogging                   1
Business Tools             2
Contacts                   1
Environment Control and    1
Finance and Payments       1
Fitness and Wearables      1
Location                   1
News and Information       2
Photo and Video            3
Security and Monitoring    1
Social Networks            4
Task Management and To-    1
acts                       1
Name:  trigger_channel_cat, dtype: int64