我有点泡菜
我有一个数据框:
Old_DF
Date. Year On/Off Gender. Status.
0 2019-03-14 09:59:30 Senior Off Campus Male Full Time
1 2019-03-13 15:56:13 Senior Off Campus Male Full Time
第一个数据框的一栏要求人们对某些事物进行排名,但是由于Jotform导出格式的无限智慧,它会采用个人排名并将其放在每个单元格中一个字符串中,这样:
0 2019-03-14 09:59:30 Senior Off Campus Male Full Time 1Food\r 2Lounge or Study Space\r 3Retail\r 4Ev... NaN
1 2019-03-13 15:56:13 Senior Off Campus Male Full Time 1Lounge or Study Space\r 2Food\r 3Academic Res... NaN
我的想法本质上是将字符串拆分为关键字,并为其分配字母值,即“食物” =“ A”,“休息室或学习空间” =“ B”
从那以后,我想将字符串本质上转换为“ ABCDEFG”的任何可能组合,并将其附加为仅包含字母组合的新列,然后计算出现次数最高的组合。
'Combo'
0 'ABCDEFG'
1 'BDCFGAE'
我的问题在数学上是很多组合,或者只有一个组合,
这是我到目前为止所写的
clean_3 =
rank
0 food lounge or study space retail event space ...
1 lounge or study space food academic resources ...
Combo_list = []
small_combo_list = []
for i in clean_3:
if clean_3[i] == 'food':
Combo_list.append('A')
elif clean_3[i] == 'lounge or study space':
Combo_list.append('B')
elif clean_3[i] == 'retail':
Combo_list.append('C')
elif clean_3[i] == 'event space':
Combo_list.append('D')
elif clean_3[i] == 'academic resources':
Combo_list.append('E')
elif clean_3[i] == 'student life':
NCombo_list.append('F')
elif clean_3[i] == 'general services':
Combo_list.append('G')
small_combo_list.append(Combo_list)
print(small_combo_list)
但是我遇到此错误:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
(至少对我来说)没有意义,因为它的数据帧不是序列。
理想情况下,如果有更有效的方法可以执行此操作,请确定此csv的大小,以使我震惊。让我知道是否需要其他说明!
编辑:当前数据帧的仅有两行,并证明了jotforms导出格式的笨拙程度
Date. Year On/Off Gender. Status. Rank
0 2019-03-14 09:59:30 Senior Off Campus Male Full Time 1Food
2Lounge or Study Space
3Retail
4Event Space
5Academic Resources (Tutoring, Career Advice)
6Student Life (Student Involvement, Diversity Services)
7General Services (Lockers, Information Desk, Vending Machines)
Date. Year On/Off Gender. Status. Rank
1 2019-03-14 09:59:30 Senior Off Campus Male Full Time 1Food
2Lounge or Study Space
3Retail
4Event Space
5Academic Resources (Tutoring, Career Advice)
6Student Life (Student Involvement, Diversity Services)
7General Services (Lockers, Information Desk, Vending Machines)
答案 0 :(得分:1)
如果我有更多示例数据,仅用两行就很难进行测试,那就更好了,但是您可以尝试一下。
首先使用.str.replace
和.str.split
清理数据。
之后,我将其转换为object
类型。
现在我们将所有选择整理好并按顺序整理。
因此我们可以像下面这样简单地groupby
和count
:
# Dataframe I worked with
Date Year On/Off Gender Status \
0 2019-03-14 09:59:30 Senior Off Campus Male Full Time
1 2019-03-13 15:56:13 Senior Off Campus Male Full Time
Ranking
0 1Food\r 2Lounge or Study Space\r 3Retail\r 4Ev...
1 1Lounge or Study Space\r 2Food\r 3Academic Res...
# Clean up Ranking column
df['Ranking'] = df.Ranking.str.replace('\d+', '').str.split('\r').astype(str)
# Count the amount of choices and convert it to a column
df['times_chosen'] = df.groupby('Ranking').Ranking.transform('size')
输出
Ranking times_chosen
0 ['Food', ' Lounge or Study Space', ' Retail', ... 1
1 ['Lounge or Study Space', ' Food', ' Academic ... 1
第二个选项
请勿仅按分组依据转换为列
df.groupby('Ranking').Ranking.size()
Ranking
['Food', ' Lounge or Study Space', ' Retail', ' Ev...'] 1
['Lounge or Study Space', ' Food', ' Academic Res...'] 1
Name: Ranking, dtype: int64
或与.agg
print(df.groupby('Ranking').agg({'Ranking': ['count']}))
Ranking
count
Ranking
['Food', ' Lounge or Study Space', ' Retail', '... 1
['Lounge or Study Space', ' Food', ' Academic R... 1
答案 1 :(得分:1)
如果格式一致,则可以在原始列(或清理后的版本)上使用groupby来快速获取计数:
df=pd.Series(
{'rank': ['food lounge or study space retail event space ...',
'food lounge or study space retail event space ...',
'lounge or study space food academic resources ...',
'lounge or study space food academic resources ...',
'lounge or study space food academic resources ...']},
dtype=str)
df.groupby('rank').size()
> rank
> food lounge or study space retail event space ... 2
> lounge or study space food academic resources ... 3
> dtype: int64