Question

我有一个数据框，其中一列包含丹麦语电影的字符串描述：

df.Description.tail()

24756    Der er nye kendisser i rundkredsen, nemlig Ski...
24757    Hvad fÃ¥r man, hvis man blander en gruppe af k...
24758    Hvordan vÃ¦lter man en minister? Hvordan Ã¸del...
24759    Der er dÃ¸mt mandehygge i hulen hos ZULUs tera...
24760    Kender du de dage pÃ¥ arbejdet, hvor alt bare ...

我首先检查列Description的所有值是否都是字符串： df.applymap(type).eq(str).all()

Video.ID.v26    False
Title            True
Category        False
Description      True
dtype: bool

我想要创建另一列，其中包含在每个字符串中找到的单词，用分隔，如下所示：

24756   [Der, er, nye, kendisser, i, rundkredsen, ...

在循环中，我还使用Rake（）删除丹麦停用词。这是我的循环：

# initializing the new column
df['Key_words'] = ""

for index, row in df.iterrows():
    plot = row['Description']

    # instantiating Rake, by default is uses english stopwords from NLTK, but we want Danish
    # and discard all puntuation characters
    r = Rake('da')

    # extracting the words by passing the text
    r.extract_keywords_from_text(plot)

    # getting the dictionary whith key words and their scores
    key_words_dict_scores = r.get_word_degrees()

    # assigning the key words to the new column
    row['Key_words'] = list(key_words_dict_scores.keys())

问题是新列Key_words为空...

df.Key_words.tail()

24756    
24757    
24758    
24759    
24760    
Name: Key_words, dtype: object

任何帮助表示赞赏。

Answer 1

来自documentation of df.iterrows：

您永远不要修改要迭代的内容。这不是保证在所有情况下都能正常工作。根据数据类型，迭代器返回一个副本而不是一个视图，并且对其进行写入将没有效果。

在您的情况下，这种组合是问题所在：

for index, row in df.iterrows():  # row is generated
    [...]
    row['Key_words'] = list(key_words_dict_scores.keys()) # row is modified

如果要使用迭代，可以通过将中间数据存储在列表中来避免上述情况，例如：

import pandas as pd

# make dummy dataframe
df = pd.DataFrame({'a':range(5)})

#initialise list
new_entries = []

# do iterrows, and operations on entries in row
for ix, row in df.iterrows():
    new_entries.append(2* row['a'])  # store intermediate data in list

df['b'] = new_entries # assign temp data to new column

另一条建议：我必须生成自己的数据框来说明我的解决方案，因为发布数据的格式不允许轻松导入/复制。请查看this post，以便提出更好的公式化问题。

Answer 2

使用套用

def my_keyword_func(row):
    plot = row['Description']
    ....
    return ['key word 1', 'key word 2']
df['Key_words'] = df.apply(my_keyword_func, axis=1)

从字符串中获取关键词列表

2 个答案: