Question

我尝试在DataFrame列中标识关键字，然后在标识关键字时创建新的二进制列。当单个字符串列表仅包含一个要识别的关键字时，下面的可重现示例有效;它还突出了问题所在的第二阶段。

问题是我想为每个列表添加更多关联关键字，以便关联术语可以有效地分类到新列中。但是，当我向列表添加多个关键字时，我得到ValueError: Length of values does not match length of index。

# 1. Create dataframe
test = {'comment': ['my pay review was not enough',
                    'my annual bonus was too low, I need more pay',
                    'my pay is too low', 'my bonus is huge', 'better pay please'],
        'team': ['team1', 'team2', 'team3', 'team1', 'team2']}

test = pd.DataFrame(test)

# 2. create string lists - (these are the lists I want to add multiple associated keywords too) 
pay_strings = ['pay']
bonus_strings = ['bonus']

# 3. Create empty lists
pay_col = []
bonus_col = []

# 4. Loop through `comment` column to identify words and represent them in the new lists with binary values

for row in test['comment']:
    for pay in pay_strings:
        if pay in row:
            pay_col.append(1)
        elif pay not in row:
            pay_col.append(0)

    for bonus in bonus_strings:
        if bonus in row:
            bonus_col.append(1)
        elif bonus not in row: 
            bonus_col.append(0)          

# 5. Add new lists to dataframe

test['pay'] = pay_col
test['bonus'] = bonus_col
test

# 6. Resulting dataframe
    comment                                       team    pay   bonus
0   my pay review was not enough                  team1   1     0
1   my annual bonus was too low, I need more pay  team2   1     1
2   my pay is too low                             team3   1     0
3   my bonus is huge                              team1   0     1
4   better pay please                             team2   1     0

有没有办法有效地查找列表中的多个项目，还是有更好的方法来执行此操作？

Answer 1

如上所述，当您添加其他关键字时，生成的pay_col列表的长度超过了数据框中导致引用错误的行数。

修改此代码块：

for row in test['comment']:
    for pay in pay_strings:
        if pay in row:
            pay_col.append(1)
        elif pay not in row:
            pay_col.append(0)

要么为每个关键字维护一个唯一的计数（在这种情况下，你的pay_string关键字列表中的每个关键字都有一个列）或修改以增加每一行的计数（即评论）已经确定。

如何在for循环中识别多个列表项

1 个答案: