从列表中删除以某些表达式开头的字符串

时间:2019-03-21 06:18:23

标签: python string data-cleaning

我有一个与Twitter标签相关的字符串列表。我想删除开始带有某些前缀的整个字符串。

例如:

testlist = ['Just caught up with #FlirtyDancing. Just so cute! Loved it. ', 'After work drinks with this one @MrLukeBenjamin no dancing tonight though @flirtydancing @AshleyBanjo #FlirtyDancing pic.twitter.com/GJpRUZxUe8', 'Only just catching up and @AshleyBanjo you are gorgeous #FlirtyDancing', 'Loved working on this. Always a pleasure getting to assist the wonderful @kendrahorsburgh on @ashleybanjogram wonderful new show !! #flirtydancing pic.twitter.com/URMjUcgmyi', 'Just watching #FlirtyDancing & \n@AshleyBanjo what an amazing way to meet someone.. It made my heart all warm & fuzzy for these people! both couples meet back up.. pic.twitter.com/iwCLRmAi5n',]

我想删除图片的URL,#标签和@。

到目前为止,我已经尝试了一些方法,即使用startswith()方法和replace()方法。

例如:

prefixes = ['pic.twitter.com', '#', '@']
bestlist = []

for line in testlist:
    for word in prefixes:
        line = line.replace(word,"")
        bestlist.append(line)

这似乎摆脱了“ pic.twitter.com”,但没有摆脱URL末尾的一系列字母和数字。这些字符串是动态的,并且每次都有不同的结束URL ...这就是为什么我要摆脱整个字符串(如果它们以该前缀开头)的原因。

我还尝试了标记所有内容,但是replace()仍然无法消除整个单词:

import nltk 

for line in testlist:
tokens = nltk.tokenize.word_tokenize(line)
for token in tokens:
    for word in prefixes:
        if token.startswith(word):
            token = token.replace(word,"")
            print(token)

我开始对startswith()方法和replace()方法失去希望,并觉得我可能对这两种方法不满意。

是否有更好的方法来解决此问题?我如何获得删除以#,@和pic.twitter开头的所有字符串的预期结果?

4 个答案:

答案 0 :(得分:3)

您可以使用正则表达式指定要替换的单词类型,并使用re.sub

import re

testlist = ['Just caught up with #FlirtyDancing. Just so cute! Loved it. ', 'After work drinks with this one @MrLukeBenjamin no dancing tonight though @flirtydancing @AshleyBanjo #FlirtyDancing pic.twitter.com/GJpRUZxUe8', 'Only just catching up and @AshleyBanjo you are gorgeous #FlirtyDancing', 'Loved working on this. Always a pleasure getting to assist the wonderful @kendrahorsburgh on @ashleybanjogram wonderful new show !! #flirtydancing pic.twitter.com/URMjUcgmyi', 'Just watching #FlirtyDancing & \n@AshleyBanjo what an amazing way to meet someone.. It made my heart all warm & fuzzy for these people! both couples meet back up.. pic.twitter.com/iwCLRmAi5n',]
regexp = r'pic\.twitter\.com\S+|@\S+|#\S+'

res = [re.sub(regexp, '', sent) for sent in testlist]
print(res)

输出

Just caught up with  Just so cute! Loved it. 
After work drinks with this one  no dancing tonight though    
Only just catching up and  you are gorgeous 
Loved working on this. Always a pleasure getting to assist the wonderful  on  wonderful new show !!  
Just watching  & 
 what an amazing way to meet someone.. It made my heart all warm & fuzzy for these people! both couples meet back up.. 

答案 1 :(得分:2)

此解决方案不使用正则表达式或任何其他导入。

prefixes = ['pic.twitter.com', '#', '@']
testlist = ['Just caught up with #FlirtyDancing. Just so cute! Loved it. ', 'After work drinks with this one @MrLukeBenjamin no dancing tonight though @flirtydancing @AshleyBanjo #FlirtyDancing pic.twitter.com/GJpRUZxUe8', 'Only just catching up and @AshleyBanjo you are gorgeous #FlirtyDancing', 'Loved working on this. Always a pleasure getting to assist the wonderful @kendrahorsburgh on @ashleybanjogram wonderful new show !! #flirtydancing pic.twitter.com/URMjUcgmyi', 'Just watching #FlirtyDancing & \n@AshleyBanjo what an amazing way to meet someone.. It made my heart all warm & fuzzy for these people! both couples meet back up.. pic.twitter.com/iwCLRmAi5n',]


def iter_tokens(line):
    for word in line.split():
        if not any(word.startswith(prefix) for prefix in prefixes):
            yield word

for line in testlist:
    row = list(iter_tokens(line))
    print(' '.join(row))

这将产生以下结果:

python test.py 
Just caught up with Just so cute! Loved it.
After work drinks with this one no dancing tonight though
Only just catching up and you are gorgeous
Loved working on this. Always a pleasure getting to assist the wonderful on wonderful new show !!
Just watching & what an amazing way to meet someone.. It made my heart all warm & fuzzy for these people! both couples meet back up..

答案 2 :(得分:1)

您需要使用正则表达式而不是静态字符串进行匹配。 replace无法识别正则表达式。您需要改用re.sub。要从单个字符串s中删除您描述的网址,您将需要以下内容:

import re
re.sub('pic\.twitter\.com[^a-zA-Z0-9,.\-!/()=?`*;:_{}\[\]\|~%-]*', '', s)

要匹配标签,答复和url,您可以执行连续的sub操作,或将所有正则表达式组合为一个表达式。如果您有很多模式,则前者更好,应与re.compile结合使用。

请注意,这只会将URL与域名twitter.com和子域pic匹配。要匹配任何网址,您必须使用适当的匹配模式来扩展正则表达式。可能会看到this post

edit:根据RFC 3986的评论,根据I.Am.A.Guy推广正则表达式。

答案 3 :(得分:1)

prefixes = {'pic.twitter.com', '#', '@'} # use sets for faster lookups

def clean_tweet(tweet):
    return " ".join(for word in line.split() if (word[:15] not in prefixes) or (word[0] not in prefixes))

或查看:

https://www.nltk.org/api/nltk.tokenize.html

TweetTokenizer可以解决您的许多问题。