在图像上具有如下所示的数据框
df = pd.DataFrame({'bus':[{268},{23,200,268},{24},{24},{200,268}],
'problem':["Driver Issues" ,"Driver Issues" , "Timing Problem","Routing",
"Timing Problem"]})
总线列指示总线号,问题列包含有关总线的投诉。在“总线”列中,任何一行都可以具有一个或多个总线号。
我正在尝试计算每个公交车号及其最常见的问题/问题/投诉..查找最常见的公交车号及其最常见的抱怨。
但是由于设置的类型,不能使用Counter函数。
输出可以像:
df2 = pd.DataFrame({'busses':["268","24","200","23"],
'ComplainFrequency':["3" ,"2" , "2","1"]})
和
Bus no: 268
Coplains:
Driver Issues:2
Timing Problem:1
....
答案 0 :(得分:2)
第一个拼合设置为新的DataFrame
:
df1 = pd.DataFrame([(c, b) for a, b in zip(df['bus'], df['problem']) for c in a],
columns=['bus','problem'])
print (df1)
bus problem
0 268 Driver Issues
1 200 Driver Issues
2 268 Driver Issues
3 23 Driver Issues
4 24 Timing Problem
5 24 Routing
6 200 Timing Problem
7 268 Timing Problem
如果存在带有,
的字符串值集,则必须进行两次展平:
df = pd.DataFrame({'bus':[{'268'},{'23,200,268'},{'24'},{'24'},{'200,268'}],
'problem':["Driver Issues" ,"Driver Issues" , "Timing Problem",
"Routing","Timing Problem"]})
print (df)
bus problem
0 {268} Driver Issues
1 {23,200,268} Driver Issues
2 {24} Timing Problem
3 {24} Routing
4 {200,268} Timing Problem
df1 = pd.DataFrame([(d, b) for a, b in zip(df['bus'], df['problem'])
for c in a
for d in c.split(',')],
columns=['bus','problem'])
print (df1)
bus problem
0 268 Driver Issues
1 23 Driver Issues
2 200 Driver Issues
3 268 Driver Issues
4 24 Timing Problem
5 24 Routing
6 200 Timing Problem
7 268 Timing Problem
然后使用GroupBy.size
:
df2 = df1.groupby('bus')['problem'].size().reset_index(name='ComplainFrequency')
print (df2)
bus ComplainFrequency
0 200 2
1 23 1
2 24 2
3 268 3
df3 = df1.groupby(['bus','problem']).size().reset_index(name='Coplains')
print (df3)
bus problem Coplains
0 200 Driver Issues 1
1 200 Timing Problem 1
2 23 Driver Issues 1
3 24 Routing 1
4 24 Timing Problem 1
5 268 Driver Issues 2
6 268 Timing Problem 1