从推文中删除常见的junks以进行主题建模

时间:2015-10-29 20:21:43

标签: python string list url twitter

我正在尝试删除RT之类的常见邮件,所有以@开头的字符串以及所有网址。我对待它的方式是这样的:

prefixes=["http","ftp","@","#","RT"]

for prefix in prefixes:
            for word in final_tweet:
                    if word.startswith(prefix):
                            print "starts with prefix"
                            word=''

虽然此代码有时会删除junks(并始终检测垃圾),但并不总是删除它们。所以我想知道问题是什么?

以下是输出的一些示例:

['RT', '@NadelParis:', 'Going2LOVEorKILL?Download', 'NOW!', 'https://t.co/xilNh66e34', '@CrookedIntriago', '@Seven13music', '@UMG', '\xe3\x82\x8f\xe3\x81\x9f\xe3\x81\x97\xe3\x81\xaf\xe3\x80\x81\xe3\x81\x82\xe3\x81\xaa\xe3\x81\x9f\xe3\x82\x92\xe6\x84\x9b\xe3\x81\x97\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99!', 'RTPlz<3', 'https:/\xe2\x80\xa6']
starts with prefix
starts with prefix
starts with prefix
starts with prefix
starts with prefix
starts with prefix
starts with prefix
starts with prefix
['Going2LOVEorKILL?Download', 'NOW!', 'https://t.co/xilNh66e34', '@CrookedIntriago', '@Seven13music', '@UMG', '\xe3\x82\x8f\xe3\x81\x9f\xe3\x81\x97\xe3\x81\xaf\xe3\x80\x81\xe3\x81\x82\xe3\x81\xaa\xe3\x81\x9f\xe3\x82\x92\xe6\x84\x9b\xe3\x81\x97\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99!', 'RTPlz<3', 'https://t.co/I40s8x3QAV']

['RT', '@dbrandSkins:', 'Dear', 'Apple,', 'T9', 'dialing', 'optional.', 'Get', 'shit', 'together.', 'Signed,\nEveryone']
starts with prefix
starts with prefix
['Dear', 'Apple,', 'T9', 'dialing', 'optional.', 'Get', 'shit', 'together.', 'Signed,\nEveryone']
['RT', '@WeLoveRobDyrdek:', 'This', 'dog', '', 'https://t.co/5N86jYipOI']
null found
starts with prefix
starts with prefix
starts with prefix
['This', 'dog', '', 'https://t.co/5N86jYipOI']
null found
starts with prefix
['RT', '@sayingsforgirls:', 'Do', 'touch', 'MY', 'iPhone.', "It's", 'usPhone,', 'wePhone,', 'ourPhone,']
starts with prefix
starts with prefix
['Do', 'touch', 'MY', 'iPhone.', "It's", 'usPhone,', 'wePhone,', 'ourPhone,']
['RT', '@BrianaaSymonee:', 'says', 'imma', 'dog,', 'takes', 'one', 'know', 'one...']
starts with prefix
starts with prefix
['says', 'imma', 'dog,', 'takes', 'one', 'know', 'one...']

2 个答案:

答案 0 :(得分:1)

您可以检查每个前缀

>>> for prefix in prefixes:
...     final_tweet = [ w for w in final_tweet if not w.startswith(prefix)]

答案 1 :(得分:0)

#Python IRC频道的某人给出的答案:

final_tweet = [word for word in final_tweet if not any (word.startswith(prefix) for prefix in prefixes)]