我有一些推特数据,我将这些文本分成了那些带有快乐表情符号和悲伤表情符号的文章,如此优雅和诡异:
happy_set = [":)",":-)","=)",":D",":-D","=D"]
sad_set = [":(",":-(","=("]
happy = [tweet.split() for tweet in data for face in happy_set if face in tweet]
sad = [tweet.split() for tweet in data for face in sad_set if face in tweet]
但是,可能会出现happy_set
和sad_set
中的表情符号都可以在一条推文中找到的情况。什么是pythonic方法来确保happy
列表仅包含来自happy_set
的表情符号,反之亦然?
答案 0 :(得分:3)
您可以尝试使用集合,特别是set.isdisjoint
。检查快乐推文中的令牌集是否与sad_set
不相交。如果是这样,它肯定属于happy
:
happy_set = set([":)",":-)","=)",":D",":-D","=D"])
sad_set = set([":(",":-(","=("])
# happy is your existing set of potentially happy tweets. To remove any tweets with sad tokens...
happy = [tweet for tweet in happy if sad_set.isdisjoint(set(tweet.split()))]
答案 1 :(得分:1)
我会使用lambdas:
>>> is_happy = lambda tweet: any(map(lambda x: x in happy_set, tweet.split()))
>>> is_sad = lambda tweet: any(map(lambda x: x in sad_set, tweet.split()))
>>> data = ["Hi, I am sad :( but don't worry =D", "Happy day :-)", "Boooh :-("]
>>> filter(lambda tweet: is_happy(tweet) and not is_sad(tweet), data)
['Happy day :-)']
>>> filter(lambda tweet: is_sad(tweet) and not is_happy(tweet), data)
['Boooh :-(']
这将避免创建data
的中间副本。
如果data
真的很大,你可以用filter
中的ifilter
替换itertools
来获取迭代器而不是列表。
答案 2 :(得分:0)
你在寻找吗?
happy_set = set([":)",":-)","=)",":D",":-D","=D"])
sad_set = set([":(",":-(","=("])
happy_maybe_sad = [tweet.split() for tweet in data for face in happy_set if face in tweet]
sad_maybe_happy = [tweet.split() for tweet in data for face in sad_set if face in tweet]
happy = [item for item in happy_maybe_sad if not in sad_maybe_happy]
sad = [item for item in sad_maybe_happy if not in happy_maybe_sad]
对于happy...
和sad...
,我坚持使用列表解决方案,因为项目的顺序可能相关。如果没有,最好使用set()
进行表演。添加,集合是否已提供basic sets operations(联合,交集等)