Question

我有一个如下的csv

SKU;price;availability;Title;Supplier
SUV500;21,50 €;1;27-03-2019 14:46;supplier1
MZ-76E;5,50 €;1;27-03-2019 14:46;supplier1
SUV500;49,95 €;0;27-03-2019 14:46;supplier2
MZ-76E;71,25 €;0;27-03-2019 14:46;supplier2
SUV500;32,60 €;1;27-03-2019 14:46;supplier3

我正在尝试将具有以下内容的csv作为输出

SKU;price;availability;Title;Supplier
SUV500;21,50 €;1;27-03-2019 14:46;supplier1
MZ-76E;5,50 €;1;27-03-2019 14:46;supplier1

我想在每个SKU上仅获得价格最低的记录

因为我完全迷失了熊猫，该怎么办？与古典，如果？带有列表集？

有什么想法吗？

Answer 1

在熊猫中，您可以执行以下操作

import pandas as pd

df= pd.read_csv('your file')

正如安迪（Andy）在下面指出的那样，这仅返回价格和SKU列

df_reduced= df.groupby('SKU')['price'].min()

对于所有列，您可以将groupby更改为要保留的所有列的列表

df_reduced= df.groupby(['SKU', 'availability', 'Title', 'Supplier'])['price'].min()

Answer 2

这里没有使用熊猫的真正需要。这可能不是最优解决方案，但可能是我的解决方案：

import csv

class Product:
    def __init__(self, sku, price, availability, title, supplier):
        self.sku = sku
        self.price = float(price.replace(',', '.')[:-2]) # allows sorting 
        self.availability = availability
        self.title = title
        self.supplier = supplier

unparsed_products = []

with open('name_of_csv.csv', 'r') as csvfile:
    csv_reader = csv.reader(csvfile, delimiter=';')
    next(csv_reader) # to skip past header line when parsing.
    for row in csv_reader:
        p = Product(*row)
        unparsed_products.append(p)

suv500_products = [i for i in unparsed_products if i.sku == 'SUV500']
lowest_priced_suv500_product = sorted(suv500_products, key=lambda x: x.price, reverse=True)[0] # gets the first entry from the sorted list of suv500_products
print(lowest_priced_suv500_product.price)
>>> 21.50

通过更改if i.sku == X中X的值，您可以轻松地将此扩展到其他产品。

Answer 3

非熊猫解决方案，可以获取所需的输出。

编辑：将csv编写器添加到解决方案

编辑：仅接受row[2]处具有'1'的记录

from collections import defaultdict
import re
from operator import itemgetter
import csv

fin = open('SKU_csv.csv', 'r', encoding="utf8")
csv_reader = csv.reader(fin, delimiter=';')

fout = open('test_out.csv', 'w', newline = '')
csv_writer = csv.writer(fout, delimiter=';')

csv_writer.writerow(next(csv_reader)) # print header

d = defaultdict(list)

for row in csv_reader:
    if int(row[2]) != 1:
        continue
    key = row[0]
    val = row[1].replace(',', '.')
    price = float(re.search('\d+\.\d+', val).group(0))
    d[key].append([row, price])

fin.close()

for arr in d.values():
    minimum, _ = min(arr, key=itemgetter(1)) # minimum price (at arr idx 1)
    csv_writer.writerow(minimum)

fout.close()


'''
*** test_out.csv contents

SKU;price;availability;Title;Supplier
SUV500;21,50 €;1;27-03-2019 14:46;supplier1
MZ-76E;5,50 €;1;27-03-2019 14:46;supplier1
'''

Answer 4

已编辑：采用先前的混淆假设

从csv文件读取后

In [8]: df = pd.read_csv(filename, delimiter=';', encoding='utf-8')

In [9]: df
Out[9]:
          SKU    price  availability             Title   Supplier
0      SUV500  21,50 €             1  27-03-2019 14:46  supplier1
1      MZ-76E   5,50 €             1  27-03-2019 14:46  supplier1
2      SUV500  49,95 €             0  27-03-2019 14:46  supplier2
3      MZ-76E  71,25 €             0  27-03-2019 14:46  supplier2
4      SUV500  32,60 €             1  27-03-2019 14:46  supplier3

添加新列以保存price的浮点值

In [12]:  df['f_price'] = df['price'].str.extract(r'([+-]?\d+\,\d+)', expand=False).str.replace(',', '.').astype(float)
#Note: if your locality using denotion `,` for decimal point, you don't need additional `str.replace`. Just use below
#df['f_price'] = df['price'].str.extract(r'([+-]?\d+\,\d+)', expand=True).astype(float)

In [13]: df
Out[13]:
          SKU    price  availability             Title   Supplier  f_price
0      SUV500  21,50 €             1  27-03-2019 14:46  supplier1    21.50
1      MZ-76E   5,50 €             1  27-03-2019 14:46  supplier1     5.50
2      SUV500  49,95 €             0  27-03-2019 14:46  supplier2    49.95
3      MZ-76E  71,25 €             0  27-03-2019 14:46  supplier2    71.25
4      SUV500  32,60 €             1  27-03-2019 14:46  supplier3    32.60

从groupby获取每组的最低价（f_price）列表

In [28]: idxmin_list = df.groupby('SKU', as_index=False)['f_price'].idxmin().tolist()

In [29]: idxmin_list
Out[29]: [1, 0]

最后，将idxmin_list传递到df并放下f_price列以获得最终结果

In [33]: df_final = df.loc[idxmin_list].drop('f_price', 1)

In [34]: df_final
Out[34]:
      SKU    price  availability             Title   Supplier
1  MZ-76E   5,50 €             1  27-03-2019 14:46  supplier1
0  SUV500  21,50 €             1  27-03-2019 14:46  supplier1

写入csv文件

In [65]: df_final.to_csv('Sku_min.csv', sep=';', index=False)

文件Sku_min.csv在您的工作文件夹中创建，其内容为

SKU;price;availability;Title;Supplier
MZ-76E;5,50 €;1;27-03-2019 14:46;supplier1
SUV500;21,50 €;1;27-03-2019 14:46;supplier1

如何使用Python查找基于同一列表的唯一值的列表元素的最小值

4 个答案: