我尝试在DataFrame列中标识关键字,然后在标识关键字时创建新的二进制列。当单个字符串列表仅包含一个要识别的关键字时,下面的可重现示例有效;它还突出了问题所在的第二阶段。
问题是我想为每个列表添加更多关联关键字,以便关联术语可以有效地分类到新列中。但是,当我向列表添加多个关键字时,我得到ValueError: Length of values does not match length of index
。
# 1. Create dataframe
test = {'comment': ['my pay review was not enough',
'my annual bonus was too low, I need more pay',
'my pay is too low', 'my bonus is huge', 'better pay please'],
'team': ['team1', 'team2', 'team3', 'team1', 'team2']}
test = pd.DataFrame(test)
# 2. create string lists - (these are the lists I want to add multiple associated keywords too)
pay_strings = ['pay']
bonus_strings = ['bonus']
# 3. Create empty lists
pay_col = []
bonus_col = []
# 4. Loop through `comment` column to identify words and represent them in the new lists with binary values
for row in test['comment']:
for pay in pay_strings:
if pay in row:
pay_col.append(1)
elif pay not in row:
pay_col.append(0)
for bonus in bonus_strings:
if bonus in row:
bonus_col.append(1)
elif bonus not in row:
bonus_col.append(0)
# 5. Add new lists to dataframe
test['pay'] = pay_col
test['bonus'] = bonus_col
test
# 6. Resulting dataframe
comment team pay bonus
0 my pay review was not enough team1 1 0
1 my annual bonus was too low, I need more pay team2 1 1
2 my pay is too low team3 1 0
3 my bonus is huge team1 0 1
4 better pay please team2 1 0
有没有办法有效地查找列表中的多个项目,还是有更好的方法来执行此操作?
答案 0 :(得分:1)
如上所述,当您添加其他关键字时,生成的pay_col列表的长度超过了数据框中导致引用错误的行数。
修改此代码块:
for row in test['comment']:
for pay in pay_strings:
if pay in row:
pay_col.append(1)
elif pay not in row:
pay_col.append(0)
要么为每个关键字维护一个唯一的计数(在这种情况下,你的pay_string关键字列表中的每个关键字都有一个列)或修改以增加每一行的计数(即评论)已经确定。