我正在尝试删除字典中的重复项,但仅基于文本值中的重复项
所以我想删除这个推文列表的副本:
{'text': 'Dear Conservatives: comprehend, if you can RT Iran deal opponents have their "death panels" lie, and it\'s a whopper http://example.com/EcSHCAm9Nn', 'id': 634092907243393024L}
{'text': 'RT Iran deal opponents now have their "death panels" lie, and it\'s a whopper http://example.com/ntECOXorvK via @voxdotcom #IranDeal', 'id': 634068454207791104L}
{'text': 'RT : Iran deal quietly picks up some GOP backers via https://example.com/65DRjWT6t8 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633631425279991812L}
{'text': 'RT : Iran deal quietly picks up some GOP backers via https://example.com/QD43vbJft6 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633495091584323584L}
{'text': "RT : Iran Deal's Surprising Supporters: https://example.com/pUG7vht0fE catoletters: Iran Deal's Surprising Supporters: http://example.com/dhdylTNgoG", 'id': 633083989180448768L}
{'text': "RT : Iran Deal's Surprising Supporters - Today on the Liberty Report: https://example.com/PVHuVTyuAG RonPaul: Iran Deal'\xe2\x80\xa6 https://example.com/sTBhL12llF", 'id': 632525323733729280L}
{'text': "RT : Iran Deal's Surprising Supporters - Today on the Liberty Report: https://example.com/PVHuVTyuAG RonPaul: Iran Deal'\xe2\x80\xa6 https://example.com/sTBhL12llF", 'id': 632385798277595137L}
{'text': "RT : Iran Deal's Surprising Supporters: https://example.com/hOUCmreHKA catoletters: Iran Deal's Surprising Supporters: http://example.com/bJSLhd9dqA", 'id': 632370745088323584L}
{'text': '#News #RT Iran deal debate devolves into clash over Jewish stereotypes and survival - W... http://example.com/foU0Sz6Jej http://example.com/WvcaNkMcu3', 'id': 631952088981868544L}
{'text': '"@JeffersonObama: RT Iran deal support from Democratic senators is 19-1 so far....but...but Schumer...."', 'id': 631951056189149184L}}
得到这个:
{'text': 'Dear Conservatives: comprehend, if you can RT Iran deal opponents have their "death panels" lie, and it\'s a whopper http://example.com/EcSHCAm9Nn', 'id': 634092907243393024L}
{'text': '"@JeffersonObama: RT Iran deal support from Democratic senators is 19-1 so far....but...but Schumer...."', 'id': 631951056189149184L}}
到目前为止,我基本上找到了基于“正常”的答案。重复键/值相同的字典。就我而言,它是一个合并的字典。由于转发,文本键是相同的,但相应的推文ID是不同的
这是整个代码,任何以更有效的方式在csv文件中编写推文的提示(使删除重复文件更容易)都不是欢迎。
import csv
import codecs
tweet_text_id = []
from TwitterSearch import TwitterSearchOrder, TwitterUserOrder, TwitterSearchException, TwitterSearch
try:
tso = TwitterSearchOrder()
tso.set_keywords(["Iran Deal"])
tso.set_language('en')
tso.set_include_entities(False)
ts = TwitterSearch(
consumer_key = "aaaaa",
consumer_secret = "bbbbb",
access_token = "cccc",
access_token_secret = "dddd"
)
for tweet in ts.search_tweets_iterable(tso):
tweet_text_id.append({'id':tweet['id'], 'text': tweet['text'].encode('utf8')});
fieldnames = ['id', 'text']
tweet_file = open('tweets.csv', 'wb')
csvwriter = csv.DictWriter(tweet_file, delimiter=',', fieldnames=fieldnames)
csvwriter.writerow(dict((fn,fn) for fn in fieldnames))
for row in tweet_text_id:
csvwriter.writerow(row)
tweet_file.close()
except TwitterSearchException as e:
print(e)
答案 0 :(得分:0)
我制作了一个模块,用于过滤掉重复实例并在路上删除主题标签“
__all__ = ['filterDuplicates']
import re
hashRegex = re.compile(r'#[a-z0-9]+', re.IGNORECASE)
trunOne = re.compile(r'^\s+')
trunTwo = re.compile(r'\s+$')
def filterDuplicates(tweets):
dupes = []
new_dict = []
for dic in tweets:
new_txt = hashRegex.sub('', dic['text']) #Removes hashtags
new_txt = trunOne.sub('', trunTwo.sub('', new_txt)) #Truncates extra spaces
print(new_txt)
dic.update({'text':new_txt})
if new_txt in dupes:
continue
dupes.append(new_txt)
new_dict.append(dic)
return new_dict
if __name__ == '__main__':
the_tweets = [
{'text':'#yolo #swag something really annoying', 'id':1},
{'text':'something really annoying', 'id':2},
{'text':'thing thing thing haha', 'id':3},
{'text':'#RF thing thing thing haha', 'id':4},
{'text':'thing thing thing haha', 'id':5}
]
#Tweets pre-filter
for dic in the_tweets:
print(dic)
#Tweets post-filter
for dic in filterDuplicates(the_tweets):
print(dic)
只需在您的脚本中导入并运行它以过滤掉推文!
答案 1 :(得分:0)
您可以尝试根据推文之间的“编辑距离”来比较推文。这是我使用fuzzywuzzy [1]来比较推文:
from fuzzywuzzy import fuzz
def clean_tweet(tweet):
"""very crude. You can improve on this!"""
tweet['text'] = tweet['text'].replace("RT :", "")
return tweet
def is_unique(tweet, seen_tweets):
for seen_tweet in seen_tweets:
ratio = fuzz.ratio(tweet['text'], seen_tweet['text'])
if ratio > DUP_THRESHOLD:
return False
return True
def dedup(tweets, threshold=50):
deduped = []
for tweet in tweets:
cleaned = clean_tweet(tweet)
if is_unique(cleaned, deduped):
deduped.append(cleaned)
return deduped
if __name__ == "__main__":
DUP_THRESHOLD = 30
tweets = [
{'text': 'Dear Conservatives: comprehend, if you can RT Iran deal opponents have their "death panels" lie, and it\'s a whopper http://t.co/EcSHCAm9Nn', 'id': 634092907243393024},
{'text': 'RT Iran deal opponents now have their "death panels" lie, and it\'s a whopper http://t.co/ntECOXorvK via @voxdotcom #IranDeal', 'id': 634068454207791104},
{'text': 'RT : Iran deal quietly picks up some GOP backers via https://t.co/65DRjWT6t8 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633631425279991812},
{'text': 'RT : Iran deal quietly picks up some GOP backers via https://t.co/QD43vbJft6 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633495091584323584},
{'text': "RT : Iran Deal's Surprising Supporters: https://t.co/pUG7vht0fE catoletters: Iran Deal's Surprising Supporters: http://t.co/dhdylTNgoG", 'id': 633083989180448768},
{'text': "RT : Iran Deal's Surprising Supporters - Today on the Liberty Report: https://t.co/PVHuVTyuAG RonPaul: Iran Deal'\xe2\x80\xa6 https://t.co/sTBhL12llF", 'id': 632525323733729280},
{'text': "RT : Iran Deal's Surprising Supporters - Today on the Liberty Report: https://t.co/PVHuVTyuAG RonPaul: Iran Deal'\xe2\x80\xa6 https://t.co/sTBhL12llF", 'id': 632385798277595137},
{'text': "RT : Iran Deal's Surprising Supporters: https://t.co/hOUCmreHKA catoletters: Iran Deal's Surprising Supporters: http://t.co/bJSLhd9dqA", 'id': 632370745088323584},
{'text': '#News #RT Iran deal debate devolves into clash over Jewish stereotypes and survival - W... http://t.co/foU0Sz6Jej http://t.co/WvcaNkMcu3', 'id': 631952088981868544},
{'text': '"@JeffersonObama: RT Iran deal support from Democratic senators is 19-1 so far....but...but Schumer...."', 'id': 631951056189149184},
]
deduped = dedup(tweets, threshold=DUP_THRESHOLD)
print deduped
给出输出:
[
{'text': 'Dear Conservatives: comprehend, if you can RT Iran deal opponents have their "death panels" lie, and it\'s a whopper http://t.co/EcSHCAm9Nn', 'id': 634092907243393024L},
{'text': ' Iran deal quietly picks up some GOP backers via https://t.co/65DRjWT6t8 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633631425279991812L}
]