Question

我的数据帧带有+-13万条鸣叫，旁边有一个标签（1 =正，0 =负）。从这个数据框中，我想提取与电影相关的推文。为此，我想出了与电影相关的单词列表：

movie_related_words = ["movie", "movies", "watch", 
                       "watching", "film", "cinema", 
                       "actor", "video", "thriller", 
                       "horror", "dvd", "bluray", "soundtrack", 
                       "director", "remake", "blockbuster"]

经过一些预处理后，数据框中的tweet被标记化，以便我数据框的text列中包含 tweets列表，其中每个单词是一个单独的列表元件。供您参考，请在下面找到我数据框的三个随机元素：

[well, time, for, bed, 500, am, comes, early, nice, chatting, with, everyone, have, a, good, evening, and, rest, of, the, weekend, whats, left, of, it]
[tekkah, defyingsantafe, umm, dont, forget, that, youre, all, gay, socialist, atheists]
[s, mom, nearly, got, ran, over, by, a, truck, on, her, bike, and, dropped, her, work, bag, with, all, her, information, which, was, then, stolen, fb]

当给定鸣叫的任何单词（因此为列表的元素）在 movie_related_words中时，我想过滤鸣叫列表中，我想保留该观察结果，否则，我想放弃它。

我曾尝试应用如下lambda表达式：

def filter_movies(text):
    movie_filtered = "".join([i for i in text if i in movie_related_words])
    return movie_filtered

twitter_loaded_df['text'] = twitter_loaded_df['text'].apply(lambda x : filter_movies(x))

但这给了我一个奇怪的结果。任何有关如何实现这一目标的指导将不胜感激。 pythonic /高效的方式会导致我对你的永恒爱。我希望为此目的存在某种熊猫功能，但我尚未找到它……

Answer 1

如果我没看错，请尝试：

twitter_loaded_df['movie_related'] = twitter_loaded_df['text'].map(lambda x: max([word in movie_related_words for word in x]))

如果这些单词中的任何一个在列表中，则应添加带有 True / False 的“ movie_related”列。

Answer 2

如何？

MOVIE_RELATED_WORDS = set(["movie", "movies", "watch", 
                           "watching", "film", "cinema", 
                           "actor", "video", "thriller", 
                           "horror", "dvd", "bluray", "soundtrack", 
                           "director", "remake", "blockbuster"])

def contains_movie_word(words):
    return any(word in MOVIE_RELATED_WORDS for word in words)

is_movie_related = df['text'].apply(contains_movie_word)

df = df[is_movie_related]  # Filter using boolean series

这种方法的优点是：

一旦在给定的推文中找到与电影相关的单词，它就会短路（提早返回）。
数据集的每一行都为O(N_tweet_words)，因为集合查找平均为O(1)。

示例：

import pandas
df = pandas.DataFrame({'text': [['Hello', 'world'], ['Great', 'movie'], ['Bad', 'weather']]})

这里df是：

             text
0  [Hello, world]
1  [Great, movie]
2  [Bad, weather]

应用解决方案后，is_movie_related为：

0    False
1     True
2    False
Name: text, dtype: bool

根据关键字列表过滤数据

2 个答案: