我目前有以下内容:
Business Name Violation Business License #
Place 1 Crime 1 111
Place 1 Crime 2 222
Place 2 Crime 3 333
Place 3 Crime 4 444
Place 3 Crime 5 444
我正在尝试获得以下信息:
Business Name Violations Business License #'s
Place 1 2 2
Place 2 1 1
Place 3 2 1
本质上,我只需要根据业务名称获取两个不同列的计数。到目前为止,这是我所知道的代码是错误的:
df.groupby(['Business Name','Business License #']).size()
任何帮助将不胜感激!
答案 0 :(得分:2)
使用pandas.DataFrame.groupby.nunique
:
df.groupby('Business Name')[['Violation','Business License #']].nunique()
Violation Business License #
Business Name
Place 1 2 2
Place 2 1 1
Place 3 2 1
答案 1 :(得分:1)
克里斯说得对,>>> rdd=["2\t{'3': 1}", "3\t{'2': 1}", "4\t{'1': 1, '2': 1}", "5\t{'4': 1, '2': 1, '6': 1}", "6\t{'2': 1, '5': 1}", "7\t{'2': 1, '5': 1}", "8\t{'2': 1, '5': 1}", "9\t{'2': 1, '5': 1}", "10\t{'5': 1}", "11\t{'5': 1}"]
>>> rdd
["2\t{'3': 1}", "3\t{'2': 1}", "4\t{'1': 1, '2': 1}", "5\t{'4': 1, '2': 1, '6': 1}", "6\t{'2': 1, '5': 1}", "7\t{'2': 1, '5': 1}", "8\t{'2': 1, '5': 1}", "9\t{'2': 1, '5': 1}", "10\t{'5': 1}", "11\t{'5': 1}"]
>>> rdd_1=sc.parallelize(rdd)
>>> rdd_1.collect()
["2\t{'3': 1}", "3\t{'2': 1}", "4\t{'1': 1, '2': 1}", "5\t{'4': 1, '2': 1, '6': 1}", "6\t{'2': 1, '5': 1}", "7\t{'2': 1, '5': 1}", "8\t{'2': 1, '5': 1}", "9\t{'2': 1, '5': 1}", "10\t{'5': 1}", "11\t{'5': 1}"]
>>> rdd_2=rdd_1.flatMap(lambda x:x.split("\t")[1].split(",")).map(lambda x:x.replace("'","").replace("'",""))
>>> len(set(rdd_2.map(lambda x:x.replace('{','').replace('}','').replace(' ','').split(":")[0]).collect()))
6
可以完成工作,但是之后您需要重设索引:
nunique