我有一个关于电影的用户评论的数据框,并且想要解析用户何时将电影描述为" movie1"遇见" movie2"
User id Old id_New id Score Comments
947952018 3101_771355141 3.0 If you want to see a comedy and have a stupid ...
805407067 11903_18330 5.0 Argento?s fever dream masterpiece. Fairy tale ...
901306244 16077_771225176 4.5 Evil Dead II meets Brothers Grimm and Hawkeye ...
901306244 NaN_381422014 1.0 Biggest disappointment! There's a host of ...
15169683 NaN_22471 3.0 You know in the original story of Pinocchio he...
我写了一个收录评论的功能,发现单词" meet"并且在会面之前和之后取前n个单词并且返回(希望)movie1&的标题的本质。 movie2,我计划稍后模糊匹配另一个数据帧中的标题。
def parse_movie(comment, num_words):
words = comment.partition('meets')
words_before = words[0].split(maxsplit=num_words)[-num_words:]
words_after = words[2].split(maxsplit=num_words)[:num_words]
movie1 = ' '.join(words_before)
movie2 = ' '.join(words_after)
return movie1, movie2
如何在原始pandas数据框的comments列中应用此函数,并将返回的movie1和movie2标题放在不同的列中?我试过了
df['Comments'].apply(parse_titles)
但是我不能指定我想要使用的num_words。直接在列上操作对我来说也不起作用,我不知道如何将新电影放入新列。
parse_movie(sample['Comments'], 4)
AttributeError: 'Series' object has no attribute 'partition'
建议将不胜感激!
答案 0 :(得分:1)
基于how to split column of tuples in pandas dataframe?回答。这可以使用lambda函数和apply(pd.Series)来完成。将结果保存到dataframe列' movie1'和' movie2'。
num_words = 4
df[['movie1','movie2']] = df['comments'].apply(lambda comment: parse_movie(comment, num_words)).apply(pd.Series)