在csv文件中保留最低价格的重复项目

时间:2015-02-05 14:06:10

标签: python

我是Python的新手,我需要阅读一个csv文件,并以最低的价格保留重复的项目。 例如:

输入文件:

name, link, price, category
item1, http://example.com/item1, 29.30, cat1
item2, http://example.com/item2, 22, cat2
item1, http://example.com/item1, 19.90, cat1

输出文件:

name, link, price, category
item2, http://example.com/item2, 22, cat2
item1, http://example.com/item1, 19.90, cat1

到目前为止,这是我的代码:

    f1 = csv.reader(open('input.csv', 'rb'), delimiter=',')
    writer = csv.writer(open("output.csv", "wb"))
    name = set()
    for row in f1:
        if row[0].lower() not in (i.lower() for i in name):
            writer.writerow(row)
            name.add(row[0])

我可以使用此代码删除重复项,但我需要帮助以保持项目的最低价格。

谢谢!

4 个答案:

答案 0 :(得分:1)

您可以使用默认值为inf的dict.get,检查存储为值的当前价格是否小于我们遇到的当前价格,然后相应地更新。最后用writerow编写从dict.items返回的元组。如果需要,我们还可以使用collections.Ordereddict保留文件顺序。

import csv
from collections import OrderedDict

d = OrderedDict() # keep the order

with open('in.csv', 'r') as f1, open("output.csv", "w") as out:
    r = csv.reader(f1,delimiter=",")
    header = next(r) # store header
    writer = csv.writer(out,delimiter=",")
    for row in r:
        price = float(row[2])
        # first check price will be less than than inf so we will add the key/value
        if d.get(row[2], float("inf")) > price:
            d[row[0]] = row
    writer.writerow(header) # write header
    for tup in d.values(): # write updated items
        writer.writerow(tup)

输出:

name, link, price, category
item1, http://example.com/item1, 19.90, cat1
item2, http://example.com/item2, 22, cat2

如果订单无关紧要,请使用defaultdict和min:

import csv
from collections import defaultdict

d = defaultdict(list) # keep the order
with open('in.csv', 'r') as f1, open("output.csv", "w") as out:
    r = csv.reader(f1,delimiter=",")
    header = next(r) # store header
    writer = csv.writer(out,delimiter=",")
    for row in r:
       d[row[0]].append(row)
    writer.writerow(header) # write header
    for k,v in d.items(): # write updated items
        writer.writerow(min(v,key=lambda x:float(x[2])))

答案 1 :(得分:0)

您可以在mu提供的以下解决方案中使用dict.setdefault缩短for循环。如果某个键不存在,dict.setdefault会为键设置一个值,否则会保持该值不变。它返回当前值,无论是否更改为o。

for row in f1: a = names.setdefault(row[0],row[1]) if row[1]<a: names[row[0]] = row[1]

答案 2 :(得分:0)

这在熊猫中是微不足道的:

import pandas as pd

df = pd.read_csv('in_csv')
df.groupby('name').min()

答案 3 :(得分:0)

csv文件列为:name, link, price, category

import itertools, operator
data = list()
new_data = list()
name = operator.itemgetter(0)
name_price = operator.itemgetter(0,2)

将标题与数据分开。

with open('data.txt') as f:
    header = f.next()
    for line in f:
        data.append(line.strip().split(','))

data是一个列表清单 - [[name, link, price, category], ...]

首先在名称上排序data,在第二位排序。

data.sort(key = name_price)

使用itertools.groupby按名称分组,从每个组中取出第一项并对其进行格式化,然后将其保存到新列表中。

for key, group in itertools.groupby(data, name):
    # the first item in the group has the lowest price
    lowest_price = list(group)[0]
    lowest_price = ','.join(lowest_price) + '\n'
    new_data.append(lowest_price)

header和```new_data`写入文件。

with open('new_data.txt', 'wb') as f:
    f.write(header)
    f.writelines(new_data)

编辑以考虑更多字段。