Question

我编写了一个python程序，需要处理机器学习任务的大型数据集。我有一套火车（约600万行）和一套测试（约200万行）。到目前为止，我的程序在合理的时间内运行，直到我到达代码的最后一部分。问题是我有我的机器学习算法进行预测，我将这些预测保存到列表中。但在我将预测写入文件之前，我需要做一件事。我的火车和测试装置有重复。我需要在火车组中找到那些重复项并提取相应的标签。为了实现这一点，我创建了一个字典，其中我的训练示例为键，我的标签为值。然后，我创建一个新列表并迭代我的测试集和训练集。如果我的测试集中的示例可以在我的列车中找到，则将相应的标签附加到我的新列表中，否则，将我的预测附加到我的新列表中。

我用来实现上述问题的实际代码：

listed_predictions =  list(predictions)

    """"creating a dictionary"""
    train_dict = dict(izip(train,labels))


    result = []
    for sample in xrange(len(listed_predictions)):
        if test[sample] in train_dict.keys():
            result.append(train_dict[test[sample]])
        else:
            result.append(predictions[sample])

此循环大约需要200万次迭代。我想到了numpy数组，因为它们应该比python列表更好地扩展，但我不知道如何用numpy数组实现相同。还考虑过像Cython这样的其他优化解决方案，但在我深入研究之前，我希望我有一些低成果，作为一个没有经过正规计算机教育的缺乏经验的程序员，我看不到。

更新我已经实现了thefourtheye的解决方案，它将我的运行时间缩短到大约10个小时，这对于我想要达到的目标来说足够快。大家，谢谢你的帮助和建议。

Answer 1

两个建议，

要检查密钥是否在dict中，只需使用in和对象（这在O（1）中发生）
```
if key in dict:
```
尽可能使用理解。

所以，你的代码就像这样

result = [train_dict.get(test[sample], predictions[sample]) for sample in xrange(len(listed_predictions))]

Answer 2

test[sample] in train_dict.keys()效率极低。它迭代train_dict的所有键，寻找值，当整个字典点应该是快速键查找。

使用test[sample] in train_dict代替 - 仅此更改可能会解决您的性能问题。

另外，您确实需要results作为列表吗？如果您只是避免创建200万个条目列表，它可能会也可能不会有助于提高性能。怎么样：

def results(sample):
    item = test[sample]
    return train_dict[item] if item in train_dict else predictions[sample]

要比较性能的东西：

def results(sample):
    # advantage - only looks up the key once
    # disadvantage - accesses `predictions` whether needed or not,
    # so could be cache inefficient
    return train_dict.get(test[sample], predictions[sample])

我们可以尝试获得两个优势：

def results(sample):
    # disadvantage - goes wrong if train_dict contains any value that's false
    return train_dict.get(test[sample]) or performance[sample]

def results(sample):
    # disadvantage - goes wrong if train_dict contains any None value
    value = train_dict.get(test[sample])
    return performance[sample] if value is None else value

def results(sample):
    # disadvantage - exception might be slow, and might be the common case
    try:
        return train_dict[test[sample]]
    except KeyError:
        return predictions[sample]

default_value = object()
def results(sample):
    # disadvantage - kind of obscure
    value = train_dict.get(test[sample], default_value)
    return performance[sample] if value is default_value else value

当然，只要您使用test函数，所有这些函数都会假定predictions和results保持不变。

Answer 3

不确定这是否会带来性能提升，但我想你可以尝试一下：

def look_up( x ):
    try:
        return train_dict[ test[ x ] ]
    except KeyError:
        return predictions[ x ]

result = map ( look_up, xrange( len( listed_predictions ) ) )

Answer 4

在Python 2.7中，假设您可以将训练样本和测试样本的字典形成为：

dict1 = dict(izip(train_samples, labels))
dict2 = dict(izip(test_samples, predictions))

然后：

result = dict(dict2.items() + [(k,v) for k,v in dict1.viewitems() if k in dict2])

为您提供始终使用训练集中已知标签的字典，但范围仅限于测试集中的样本。如果需要，您可以将其恢复到列表中。

使用 pandas 中的系列或 numpy ，其中和 unique < /强>

加快我的python代码的技巧

4 个答案: