Question

我的代码的想法是，当同一用户使用相同的device_id复制时，它将更新列表（在我的情况下，创建一个新列表）并删除重复的条目。此外，它将从重复项中获取最后的id1，id2和id3，并将它们放到新列表的一个项中，并使用重复项的类型更新类型。

为了解释这一点，我提供了一个包含4个列表的示例（在更新列表之前和更新之后打印）

我的代码可以工作，但是我还有另一个列表，大约有80万个列表，我尝试在其中运行代码，并且运行了一个小时。我该如何更好地解决这个问题？（无法更改输入类型，因为这是来自另一个API调用，我只能更改删除重复项的代码）

my_list = []
#   [device_id, location, type, name, ph, addr, email, id1, id2, id3]
val1=  ['12345653', 'SOUTH', 'Broadband', 'Mr Glasses', '+123344', 'MY ADDRESS', '880@myemail', '', '']
val2=  ['12345653', 'SOUTH', 'IPTV', 'Mr Glasses', '+123344', 'MY ADDRES', '', '999@myemail', '']
val3=  ['98102344', 'SOUTH', 'Voice', 'Ms Chair', '+99123123', 'Corner Street Behind Door', '', '', '990@securemail']
val4=  ['11023424', 'SOUTH', 'IPTV', 'Mr Tree', '+125324', 'Upwards error 123', '', '47@securemail', '']


my_list.append(val1)
my_list.append(val2)
my_list.append(val3)
my_list.append(val4)

for x in my_list:
    print x

print 'start removing duplication'
print ''
def rm_dupl(my_list):
    fin_list = []
    dev_exist = []

    for x in my_list:
        dev_id = x[0]
        if dev_id in dev_exist:
            # if entry exist, we just update the existing entry with 
            # the value of this current x, and not creating a new entry
            for y in fin_list:
                if dev_id in y[0]:
                    # y is retrieved value
                    # below we update with the duplication one
                    if 'Broadband' in x[2]:
                        y[2] += '_Broadband'
                        y[6] = x[6]
                    elif 'IPTV' in x[2]:
                        y[2] += '_IPTV'
                        y[7] = x[7]
                    elif 'Voice' in x[2]:
                        y[2] += '_Voice'
                        y[8] = x[8]
                else:
                    continue
        else:
            fin_list.append(x)
            dev_exist.append(dev_id)
    return fin_list


updated_list = rm_dupl(my_list)
for x in updated_list:
    print x

Answer 1

如果将import pandas as pd df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'], 'bar': ['A', 'B', 'C', 'A', 'B', 'C'], 'baz': [1, 2, 3, 4, 5, 6], 'zoo': ['x', 'y', 'z', 'q', 'w', 't']}) df.set_index(['foo','bar'],inplace=True) df.unstack(level=1)设置为集合，则检查是否存在值会变得更快。当前，每个值将必须遍历dev_exist列表中的所有值以检查它是否已经存在。但是，使用散列检查集合中是否存在值，这会更快。

这将花费大部分时间。

编辑：查找重复项时，也可以用字典替换列表。字典还提供快速的dev_exist方法。

in

Answer 2

正如其他人已经提到的那样，扫描列表不仅效率很低，而且O（n），因此列表越大，查找时间越长。

这里有两个列表扫描，一个是隐式（for y in fin_list: if dev_id in y[0]:），一个是显式（collections.OrderedDict）。

解决方案是使用dict（如果插入顺序很重要，则使用dev_exist）来存储重复数据删除的结果，以“ id”作为键，并将行作为值-Dict键查找为0（1）（持续时间）且非常快。此字典还将替换if 'somestring' in x[i]列表。

另外，根据您的示例数据，您可能希望将if x[i] == 'somestring'替换为'foo' in 'foobar'，这更准确（def rm_dupl(my_list): results = {} # or `collections.OrderedDict` for row in my_list: prev_row = results.get(row[0]) if prev_row: # if entry exist, we just update the existing entry with # the value of this current row, and not creating a new entry # below we update with the duplication one val = row[2] # avoids multiple access to `row[2]` if val == 'Broadband': prev_row[2] += '_Broadband' prev_row[6] = row[6] elif val == 'IPTV': prev_row[2] += '_IPTV' prev_row[7] = row[7] elif val == 'Voice': prev_row[2] += '_Voice' prev_row[8] = row[8] else: # no matching row found, let's add # a new one results[dev_id] = row # and returns the values # NB in py3 you'll want `list(results.values())` instead return results.values()将返回true，可能不是您想要的），并且（略）更快（取决于字符串的长度）。

import org.apache.commons.io
import java.util.Arrays;

python改进列表处理以获得更大的输入

2 个答案: