按一列中出现的频率对整个csv进行排序

时间:2015-06-11 17:20:33

标签: python sorting csv pandas frequency

我有一个大的csv文件,它是来电数据的日志。

我文件的简短片段:

CompanyName    High Priority     QualityIssue
Customer1         Yes             User
Customer1         Yes             User
Customer2         No              User
Customer3         No              Equipment
Customer1         No              Neither
Customer3         No              User
Customer3         Yes             User
Customer3         Yes             Equipment
Customer4         No              User

我想按照客户出现的频率对整个列表进行排序,这样就像是:

CompanyName    High Priority     QualityIssue
Customer3         No               Equipment
Customer3         No               User
Customer3         Yes              User
Customer3         Yes              Equipment
Customer1         Yes              User
Customer1         Yes              User
Customer1         No               Neither
Customer2         No               User
Customer4         No               User

我已经尝试过groupby,但是只打印出公司名称和频率而不是其他列,我也试过了

df['Totals']= [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]

df = [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]

但这些给我错误:ValueError:错误的项目数量传递1,指数意味着24

我看过这样的事情:

for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
    print "%s: %s" % (key, value)

但这只打印出两列,我想整理我的整个csv。我的输出应该是我的整个csv按第一列排序。

提前感谢您的帮助!

4 个答案:

答案 0 :(得分:5)

这似乎符合您的要求,基本上通过执行groupbytransform value_counts来添加计数列,然后您可以对该列进行排序:

In [22]:

df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort('count', ascending=False)
Out[22]:
  CompanyName HighPriority QualityIssue count
5   Customer3           No         User     4
3   Customer3           No    Equipment     4
7   Customer3          Yes    Equipment     4
6   Customer3          Yes         User     4
0   Customer1          Yes         User     3
4   Customer1           No      Neither     3
1   Customer1          Yes         User     3
8   Customer4           No         User     1
2   Customer2           No         User     1

您可以使用df.drop删除无关列:

In [24]:
df.drop('count', axis=1)

Out[24]:
  CompanyName HighPriority QualityIssue
5   Customer3           No         User
3   Customer3           No    Equipment
7   Customer3          Yes    Equipment
6   Customer3          Yes         User
0   Customer1          Yes         User
4   Customer1           No      Neither
1   Customer1          Yes         User
8   Customer4           No         User
2   Customer2           No         User

答案 1 :(得分:5)

2021 年更新

EdChumIlya K. 提出的答案不再有效。


函数 pd.Series.value_counts 返回一个包含唯一值计数的系列。但是,由于我们将 pd.Series.value_counts 应用于 DataFrame 并将 CompanyName 系列拆分为一组之前的唯一值。因此,我们应用该函数后的最终输出将如下所示。

groupby

这是无稽之谈,我们无法将系列中的值转换为整个系列。不知何故,我们只需要整数 Customer3 4 dtype: int64 而不是整个系列。


不过,我们可以利用前面的 4 函数,通过计算每个组中的值的数量,将整个组转换为该组中的值的数量,并将它们放在一起成为最终的 频率系列。

我们可以用 pd.Series.count 替换 groupby 或者只是简单地使用函数名称 pd.Series.value_counts

count

输出

import pandas as pd

df = pd.DataFrame({'CompanyName': {0: 'Customer1', 1: 'Customer1', 2: 'Customer2', 3: 'Customer3', 4: 'Customer1', 5: 'Customer3', 6: 'Customer3', 7: 'Customer3', 8: 'Customer4'}, 'HighPriority': {0: 'Yes', 1: 'Yes', 2: 'No', 3: 'No', 4: 'No', 5: 'No', 6: 'Yes', 7: 'Yes', 8: 'No'}, 'QualityIssue': {0: 'User', 1: 'User', 2: 'User', 3: 'Equipment', 4: 'Neither', 5: 'User', 6: 'User', 7: 'Equipment', 8: 'User'}})

df['Frequency'] = df.groupby('CompanyName')['CompanyName'].transform('count')
df.sort_values('Frequency', inplace=True, ascending=False)

答案 2 :(得分:3)

top-voted answer需要稍加补充:sort已被弃用,转而使用sort_valuessort_index

sort_values的工作原理如下:

    import pandas as pd
    df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
    df['count'] = \
    df.groupby('a')['a']\
    .transform(pd.Series.value_counts)
    df.sort_values('count', inplace=True, ascending=False)
    print('df sorted: \n{}'.format(df))
df sorted:
a  b  count
0  1  1      2
2  1  3      2
1  2  2      1

答案 3 :(得分:0)

我认为必须有更好的方法来做到这一点,但这应该有效:

准备数据:

data = """
CompanyName  HighPriority     QualityIssue
Customer1         Yes             User
Customer1         Yes             User
Customer2         No              User
Customer3         No              Equipment
Customer1         No              Neither
Customer3         No              User
Customer3         Yes             User
Customer3         Yes             Equipment
Customer4         No              User
"""
df = pd.read_table(io.StringIO(data), sep=r"\s*")

进行转型:

# create a (sorted) data frame that lists the customers with their number of occurrences
count_df = pd.DataFrame(df.CompanyName.value_counts())

# join the count data frame back with the original data frame
new_index = count_df.merge(df[["CompanyName"]], left_index=True, right_on="CompanyName")

# output the original data frame in the order of the new index.
df.reindex(new_index.index)

输出:

    CompanyName HighPriority    QualityIssue
3   Customer3   No  Equipment
5   Customer3   No  User
6   Customer3   Yes User
7   Customer3   Yes Equipment
0   Customer1   Yes User
1   Customer1   Yes User
4   Customer1   No  Neither
8   Customer4   No  User
2   Customer2   No  User

这可能不直观,这里发生了什么,但目前我想不出更好的方法来做到这一点。我试图尽可能地评论。

这里棘手的部分是count_df的索引是客户的(唯一)出现。因此,我将count_dfleft_index=True)的索引与CompanyName dfright_on="CompanyName")的count_df列加入。

这里的神奇之处在于$('.select2').select2();已经按出现次数排序,这就是我们不需要显式排序的原因。因此,我们所要做的就是通过连接数据框的行对原始数据框的行重新排序,我们得到预期的结果。