Question

我有一个工作正常的python代码，但是对于大型liste（超过10万），该代码非常慢，我需要对其进行优化。

我的列表添加了一些重复值，我想将这些重复项归为同一输入（例如，数量更新）。

Ex来源列表（第一个条目是交易的ID）

id,quantity
1,10 <--
1,20 <--
2,25
3,30

我想要：

id,quantity
1,30 <--
2,25
3,30

当前代码基于for循环，haversine是先前声明的函数（为了计算距离，对于此查询而言并不重要）

years = ['2018','2017','2016','2015','2014']

for year in years:
    print(year)
    try:
        with open('/home/' + year + '/' + cod + '.csv', encoding='utf-8') as csvfile:
            data = csv.DictReader(csvfile)
            lines = [x for x in data]
            for row in lines[::-1]:
                try:
                    x=float(row['latitude'])
                    y=float(row['longitude'])
                    if(math.isnan(x) == False and math.isnan(y) == False):
                        haversine2 = round((haversine(lon1, lat1, float(row['longitude']), float(row['latitude'])))*1000)
                        z=float(haversine2)
                        if(math.isnan(z) == False):
                            if not liste:
                                liste.append([haversine2,row['latitude'],row['longitude'],quantity])

                            else:
                                for idx,sublist in enumerate(liste):
                                    if sublist[2] == id_mut:
                                      liste[idx][3] = sum(filter(None, [liste[idx][3],quantity]))
                                      doublon = 'ok'
                                      break
                                    else:
                                        doublon = 'nok'
                                if doublon != 'ok':
                                  liste.append([haversine2,row['latitude'],row['longitude',quantity]])
                except Exception as e:
                    print("Error => : ", str(e))
    except Exception as e:
        print("Error => : ", str(e))

更新：

最后，@ Chris用df.groupby pandas函数为我提供了很好的输入，它帮助我将时间优化了53倍！

Answer 1

collections.defaultdict可能会对您有所帮助。

例如：

import csv
from collections import defaultdict

result = defaultdict(int)

with open(filename) as csvfile:
    data = csv.DictReader(csvfile)
    for row in data:                                   #Iterate each row
        result[row["id"]] += int(row["quantity"])      #groupby and increment. 
print(result)

输出：

defaultdict(<type 'int'>, {'1': 30, '3': 30, '2': 25})

优化python循环以获取具有重复项的大型列表

1 个答案: