我有一个大的csv文件,它是来电数据的日志。
我文件的简短片段:
CompanyName High Priority QualityIssue
Customer1 Yes User
Customer1 Yes User
Customer2 No User
Customer3 No Equipment
Customer1 No Neither
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer4 No User
我想按照客户出现的频率对整个列表进行排序,这样就像是:
CompanyName High Priority QualityIssue
Customer3 No Equipment
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer1 Yes User
Customer1 Yes User
Customer1 No Neither
Customer2 No User
Customer4 No User
我已经尝试过groupby,但是只打印出公司名称和频率而不是其他列,我也试过了
df['Totals']= [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
和
df = [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
但这些给我错误:ValueError:错误的项目数量传递1,指数意味着24
我看过这样的事情:
for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
print "%s: %s" % (key, value)
但这只打印出两列,我想整理我的整个csv。我的输出应该是我的整个csv按第一列排序。
提前感谢您的帮助!
答案 0 :(得分:5)
这似乎符合您的要求,基本上通过执行groupby
和transform
value_counts
来添加计数列,然后您可以对该列进行排序:
In [22]:
df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort('count', ascending=False)
Out[22]:
CompanyName HighPriority QualityIssue count
5 Customer3 No User 4
3 Customer3 No Equipment 4
7 Customer3 Yes Equipment 4
6 Customer3 Yes User 4
0 Customer1 Yes User 3
4 Customer1 No Neither 3
1 Customer1 Yes User 3
8 Customer4 No User 1
2 Customer2 No User 1
您可以使用df.drop
删除无关列:
In [24]:
df.drop('count', axis=1)
Out[24]:
CompanyName HighPriority QualityIssue
5 Customer3 No User
3 Customer3 No Equipment
7 Customer3 Yes Equipment
6 Customer3 Yes User
0 Customer1 Yes User
4 Customer1 No Neither
1 Customer1 Yes User
8 Customer4 No User
2 Customer2 No User
答案 1 :(得分:5)
函数 pd.Series.value_counts
返回一个包含唯一值计数的系列。但是,由于我们将 pd.Series.value_counts
应用于 DataFrame 并将 CompanyName 系列拆分为一组之前的唯一值。因此,我们应用该函数后的最终输出将如下所示。
groupby
这是无稽之谈,我们无法将系列中的值转换为整个系列。不知何故,我们只需要整数 Customer3 4
dtype: int64
而不是整个系列。
不过,我们可以利用前面的 4
函数,通过计算每个组中的值的数量,将整个组转换为该组中的值的数量,并将它们放在一起成为最终的 频率系列。
我们可以用 pd.Series.count
替换 groupby
或者只是简单地使用函数名称 pd.Series.value_counts
count
import pandas as pd
df = pd.DataFrame({'CompanyName': {0: 'Customer1', 1: 'Customer1', 2: 'Customer2', 3: 'Customer3', 4: 'Customer1', 5: 'Customer3', 6: 'Customer3', 7: 'Customer3', 8: 'Customer4'}, 'HighPriority': {0: 'Yes', 1: 'Yes', 2: 'No', 3: 'No', 4: 'No', 5: 'No', 6: 'Yes', 7: 'Yes', 8: 'No'}, 'QualityIssue': {0: 'User', 1: 'User', 2: 'User', 3: 'Equipment', 4: 'Neither', 5: 'User', 6: 'User', 7: 'Equipment', 8: 'User'}})
df['Frequency'] = df.groupby('CompanyName')['CompanyName'].transform('count')
df.sort_values('Frequency', inplace=True, ascending=False)
答案 2 :(得分:3)
top-voted answer需要稍加补充:sort
已被弃用,转而使用sort_values
和sort_index
。
sort_values
的工作原理如下:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
df['count'] = \
df.groupby('a')['a']\
.transform(pd.Series.value_counts)
df.sort_values('count', inplace=True, ascending=False)
print('df sorted: \n{}'.format(df))
df sorted: a b count 0 1 1 2 2 1 3 2 1 2 2 1
答案 3 :(得分:0)
我认为必须有更好的方法来做到这一点,但这应该有效:
准备数据:
data = """
CompanyName HighPriority QualityIssue
Customer1 Yes User
Customer1 Yes User
Customer2 No User
Customer3 No Equipment
Customer1 No Neither
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer4 No User
"""
df = pd.read_table(io.StringIO(data), sep=r"\s*")
进行转型:
# create a (sorted) data frame that lists the customers with their number of occurrences
count_df = pd.DataFrame(df.CompanyName.value_counts())
# join the count data frame back with the original data frame
new_index = count_df.merge(df[["CompanyName"]], left_index=True, right_on="CompanyName")
# output the original data frame in the order of the new index.
df.reindex(new_index.index)
输出:
CompanyName HighPriority QualityIssue
3 Customer3 No Equipment
5 Customer3 No User
6 Customer3 Yes User
7 Customer3 Yes Equipment
0 Customer1 Yes User
1 Customer1 Yes User
4 Customer1 No Neither
8 Customer4 No User
2 Customer2 No User
这可能不直观,这里发生了什么,但目前我想不出更好的方法来做到这一点。我试图尽可能地评论。
这里棘手的部分是count_df
的索引是客户的(唯一)出现。因此,我将count_df
(left_index=True
)的索引与CompanyName
df
(right_on="CompanyName"
)的count_df
列加入。
这里的神奇之处在于$('.select2').select2();
已经按出现次数排序,这就是我们不需要显式排序的原因。因此,我们所要做的就是通过连接数据框的行对原始数据框的行重新排序,我们得到预期的结果。