我的数据帧带有+-13万条鸣叫,旁边有一个标签(1 =正,0 =负)。从这个数据框中,我想提取与电影相关的推文。为此,我想出了与电影相关的单词列表:
movie_related_words = ["movie", "movies", "watch",
"watching", "film", "cinema",
"actor", "video", "thriller",
"horror", "dvd", "bluray", "soundtrack",
"director", "remake", "blockbuster"]
经过一些预处理后,数据框中的tweet被标记化,以便我数据框的text列中包含 tweets列表,其中每个单词是一个单独的列表元件。供您参考,请在下面找到我数据框的三个随机元素:
[well, time, for, bed, 500, am, comes, early, nice, chatting, with, everyone, have, a, good, evening, and, rest, of, the, weekend, whats, left, of, it]
[tekkah, defyingsantafe, umm, dont, forget, that, youre, all, gay, socialist, atheists]
[s, mom, nearly, got, ran, over, by, a, truck, on, her, bike, and, dropped, her, work, bag, with, all, her, information, which, was, then, stolen, fb]
当给定鸣叫的任何单词(因此为列表的元素)在 movie_related_words中时,我想过滤鸣叫列表中,我想保留该观察结果,否则,我想放弃它。
我曾尝试应用如下lambda表达式:
def filter_movies(text):
movie_filtered = "".join([i for i in text if i in movie_related_words])
return movie_filtered
twitter_loaded_df['text'] = twitter_loaded_df['text'].apply(lambda x : filter_movies(x))
但这给了我一个奇怪的结果。任何有关如何实现这一目标的指导将不胜感激。 pythonic /高效的方式会导致我对你的永恒爱。我希望为此目的存在某种熊猫功能,但我尚未找到它……
答案 0 :(得分:1)
如果我没看错,请尝试:
twitter_loaded_df['movie_related'] = twitter_loaded_df['text'].map(lambda x: max([word in movie_related_words for word in x]))
如果这些单词中的任何一个在列表中,则应添加带有 True / False 的“ movie_related”列。
答案 1 :(得分:1)
如何?
MOVIE_RELATED_WORDS = set(["movie", "movies", "watch",
"watching", "film", "cinema",
"actor", "video", "thriller",
"horror", "dvd", "bluray", "soundtrack",
"director", "remake", "blockbuster"])
def contains_movie_word(words):
return any(word in MOVIE_RELATED_WORDS for word in words)
is_movie_related = df['text'].apply(contains_movie_word)
df = df[is_movie_related] # Filter using boolean series
这种方法的优点是:
O(N_tweet_words)
,因为集合查找平均为O(1)
。示例:
import pandas
df = pandas.DataFrame({'text': [['Hello', 'world'], ['Great', 'movie'], ['Bad', 'weather']]})
这里df
是:
text
0 [Hello, world]
1 [Great, movie]
2 [Bad, weather]
应用解决方案后,is_movie_related
为:
0 False
1 True
2 False
Name: text, dtype: bool