根据关键字列表过滤数据

时间:2019-11-24 17:43:39

标签: python pandas dataframe text

我的数据帧带有+-13万条鸣叫,旁边有一个标签(1 =正,0 =负)。从这个数据框中,我想提取与电影相关的推文。为此,我想出了与电影相关的单词列表:

movie_related_words = ["movie", "movies", "watch", 
                       "watching", "film", "cinema", 
                       "actor", "video", "thriller", 
                       "horror", "dvd", "bluray", "soundtrack", 
                       "director", "remake", "blockbuster"]

经过一些预处理后,数据框中的tweet被标记化,以便我数据框的text列中包含 tweets列表,其中每个单词是一个单独的列表元件。供您参考,请在下面找到我数据框的三个随机元素:

[well, time, for, bed, 500, am, comes, early, nice, chatting, with, everyone, have, a, good, evening, and, rest, of, the, weekend, whats, left, of, it]
[tekkah, defyingsantafe, umm, dont, forget, that, youre, all, gay, socialist, atheists]
[s, mom, nearly, got, ran, over, by, a, truck, on, her, bike, and, dropped, her, work, bag, with, all, her, information, which, was, then, stolen, fb]

当给定鸣叫的任何单词(因此为列表的元素) movie_related_words中时,我想过滤鸣叫列表中,我想保留该观察结果,否则,我想放弃它。

我曾尝试应用如下lambda表达式:

def filter_movies(text):
    movie_filtered = "".join([i for i in text if i in movie_related_words])
    return movie_filtered

twitter_loaded_df['text'] = twitter_loaded_df['text'].apply(lambda x : filter_movies(x))

但这给了我一个奇怪的结果。任何有关如何实现这一目标的指导将不胜感激。 pythonic /高效的方式会导致我对你的永恒爱。我希望为此目的存在某种熊猫功能,但我尚未找到它……

2 个答案:

答案 0 :(得分:1)

如果我没看错,请尝试:

twitter_loaded_df['movie_related'] = twitter_loaded_df['text'].map(lambda x: max([word in movie_related_words for word in x]))

如果这些单词中的任何一个在列表中,则应添加带有 True / False 的“ movie_related”列。

答案 1 :(得分:1)

如何?

MOVIE_RELATED_WORDS = set(["movie", "movies", "watch", 
                           "watching", "film", "cinema", 
                           "actor", "video", "thriller", 
                           "horror", "dvd", "bluray", "soundtrack", 
                           "director", "remake", "blockbuster"])

def contains_movie_word(words):
    return any(word in MOVIE_RELATED_WORDS for word in words)

is_movie_related = df['text'].apply(contains_movie_word)

df = df[is_movie_related]  # Filter using boolean series

这种方法的优点是:

  1. 一旦在给定的推文中找到与电影相关的单词,它就会短路(提早返回)。
  2. 数据集的每一行都为O(N_tweet_words),因为集合查找平均为O(1)

示例:

import pandas
df = pandas.DataFrame({'text': [['Hello', 'world'], ['Great', 'movie'], ['Bad', 'weather']]})

这里df是:

             text
0  [Hello, world]
1  [Great, movie]
2  [Bad, weather]

应用解决方案后,is_movie_related为:

0    False
1     True
2    False
Name: text, dtype: bool