我正在尝试获取特定格式的csv,以便其他代码可以正确读取它。我使用Ordereddicts订购了它,但它需要更长的时间,我的绘图代码给了我“StringIO()不接受关键字参数”错误。虽然我认为我可以修复它,但我更喜欢我的value_counts方法,因为它更快。我得到一个包含正确信息的csv文件,接下来我需要的步骤就是格式化。我在类似问题上查找了多个线程,但没有查找如何对这种特定方式进行排序。
我的代码:
import csv
import numpy as np
import pandas as pd
from collections import defaultdict, Counter
import pandas.util.testing as tm; tm.N = 3
data = pd.DataFrame.from_csv('MYDATA.csv')
data[['QualityIssue','CompanyName']]
data['QualityIssue'].value_counts()
RatedCustomerCallers = data['CompanyName'].value_counts()
TopCustomerCallers = RatedCustomerCallers[0:18]
print(TopCustomerCallers)
TopCustomerCallers.to_csv('topcustomercallerslist.csv')
byqualityissue = data.groupby(["CompanyName","QualityIssue"]).size()
print byqualityissue
byqualityissue.to_csv('byqualityissue.csv', header=True)
输出:
CompanyName, QualityIssue, 0
Company 1, Equipment Error, 15
Company 2, User Error, 1
Company 2, Equipment Error, 5
Company 3, Equipment Error, 3
Company 3, User Error, 10
Company 3, Neither, 13
针对每种类型的问题重复公司名称。
但是,我希望它按热门客户排序(添加设备数量,用户数量,没有呼叫数量)并以这种方式显示:
Top Calling Customers, Equipment, User, Neither,
Company 3, 3, 10, 13,
Company 1, 15, 0, 0,
Customer 2, 5, 1, 0,
我尝试使用数据透视表
df = pd.DataFrame(byqualityissue)
df.pivot(index='CompanyName', columns='QualityIssue', values='0')
但它给了我KeyError:'0',这很奇怪,因为我把它放在值的输入。此外,我不确定它是否会起作用,因为每个客户的输出只是他们调用的类型。因为,公司1只有设备错误调用,因此它不会列出用户错误或两个调用。不确定数据透视表是否会考虑到这一点。
答案 0 :(得分:1)
读取您的CSV文件。通过公司和质量问题对其进行索引,然后在质量问题上将其取消堆叠。最后,替换发生的Nan
值,因为找不到匹配的数据
In [341]: d1
Out[341]:
Company Name Quality Issue Cases
0 Co 1 Equipment 15
1 Co 2 User 1
2 Co 2 Equipment 5
3 Co 3 Equipment 3
4 Co 3 User 10
5 Co 3 Neither 13
In [342]: d2 = d1.set_index(["Company Name", "Quality Issue"])
In [343]: d2
Out[343]:
Cases
Company Name Quality Issue
Co 1 Equipment 15
Co 2 User 1
Equipment 5
Co 3 Equipment 3
User 10
Neither 13
In [344]: d3 = d2.unstack("Quality Issue")
In [345]: d3.fillna(0)
Out[345]:
Cases
Quality Issue Equipment Neither User
Company Name
Co 1 15 0 0
Co 2 5 0 1
Co 3 3 13 10
答案 1 :(得分:1)
本着StackOverflow的精神,这就是我解决问题的方法。
import numpy as np
import pandas as pd
import pandas.util.testing as tm; tm.N = 3
data = pd.DataFrame.from_csv('MYDATA.csv')
byqualityissue = data.groupby(["CompanyName","QualityIssue"]).size()
df = pd.DataFrame(byqualityissue)
formatted = df.unstack(level=-1)
formatted[np.isnan(formatted)] = 0
formatted.to_csv('byqualityissue.csv', header=True)
includingtotals = pd.concat([formatted,pd.DataFrame(formatted.sum(axis=1),columns=['Total'])],axis=1)
sorted = includingtotals.sort_index(by=['Total'], ascending=[False])
我使用unstack来重新组织我的数据,将NaN值替换为0,将所有行相加并添加带有这些值的新列,然后进行排序。