使用python删除重复的CSV条目

时间:2012-06-05 21:31:33

标签: python csv

我刚刚完成了一个(叹气)终于有效的脚本。它在Twitter上搜索关键字。结果将写入带有4列关键字,Tweet,Lat,Lon(位置)的csv。我正在使用的代码是:

import tweepy
import csv

keywordList = ['McDonalds', 'Taco Bell', 'Burger King',]
for keyword in keywordList:
    result = tweepy.api.search(q=keyword,rpp=1000,page=2, geocode= "34.085422,-117.900879,500mi" )

    with open(r'C:\Temp\results.csv', 'a') as acsv:
        w = csv.writer(acsv)
        for tweet in result:
            lat, lon = tweet.geo if tweet.geo else ('', '')
            try:
                a = tweet.geo['coordinates']
                print a[0] , a[1]
                print tweet.text
                w.writerow((keyword, tweet.text, a[0] , a[1]))
            except:
                pass

我想使用任务管理器或python每5分钟运行一次这个搜索,但它会重写重复项。我打算使用以下代码删除重复项,但有两件事情发生。 resutls2.csv是空白的,当我打开csv时,它被锁定,我必须以只读方式查看它。我尝试了f1.close(),writer.close()等,但它说'csv.reader'对象没有属性关闭。

我最关心的是通过写入新的csv或以某种方式在每次搜索时删除和写入同一个表来获取重复项。任何建议都非常感谢!!

import csv

f1 = csv.reader(open(r'C:\Temp\results.csv', 'rb'))
writer = csv.writer(open(r'C:\Temp\results2.csv', 'wb'))
tweet = set()
for row in f1:
    if row[1] not in tweet:
        writer.writerow(row)
        tweet.add( row[1] )
f1.close()
writer.close()

1 个答案:

答案 0 :(得分:0)

这是重构版本:

编辑: unicode,有趣的是 - 我在read_csv()中添加了一个.decode()调用,在append_csv()中添加了一个.encode()调用;这应该可以解决你的问题(我想 - 你可能需要决定一个字符串编解码器)。

import tweepy
import csv
from collections import defaultdict
import time

FILE = 'c:/temp/results.csv'
KEYWORDS = ['McDonalds', 'Taco Bell', 'Burger King']
WHERE = "34.085422,-117.900879,500mi"
DELAY = 300  # seconds

def _float(s, err=None):
    try:
        return float(s)
    except ValueError:
        return err

def _str(f, err=""):
    return err if f is None else str(f)

def read_csv(fname=FILE):
    data = defaultdict(dict)
    with open(fname, 'rb') as inf:
        incsv = csv.reader(inf)
        for kw,tw,lat,lon in incsv:
            # added .decode() call to handle saved unicode chars
            data[kw][tw.decode()] = (_float(lat), _float(lon))
    return data

def append_csv(data, fname=FILE):
    with open(fname, "ab") as outf:
        outcsv = csv.writer(outf)
        # added .encode() call to handle saved unicode chars
        outcsv.writerows((kw,tw.encode(),_str(lat),_str(lon)) for kw,dat in data.iteritems() for tw,(lat,lon) in dat.iteritems())

def search_twitter(keywords=KEYWORDS, loc=WHERE):
    data = defaultdict(dict)
    for kw in keywords:
        for tweet in tweepy.api.search(q=kw, rpp=1000, page=2, geocode=loc):
            data[kw][tweet.text] = tweet.geo if tweet.geo else (None,None)
    return data

def calc_update(old_data, new_data):
    diff = defaultdict(dict)
    for kw,dat in new_data.iteritems():
        for tw,loc in dat.iteritems():
            if tw not in old_data[kw]:
                diff[kw][tw] = old_data[kw][tw] = loc
    return old_data, diff

def show_data(data):
    for kw,dat in data.iteritems():
        for tw,(lat,lon) in dat.iteritems():
            print("<{},{}> {} [{}]".format(_str(lat,"???"), _str(lon,"???"), tw, kw))

def main():
    data = read_csv()
    while True:
        new_data  = search_twitter()
        data,diff = calc_update(data, new_data)
        append_csv(diff)
        show_data(diff)
        time.sleep(DELAY)

if __name__=="__main__":
    main()